In [59]:
import pandas as pd
pd.set_option('display.max_colwidth', 0) #To display entire text content of a column
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
from sklearn.model_selection import train_test_split
import os
import json

In [60]:
os.chdir('D:\capstone_data')

Final dataset for our model will created from 3 separate datsets:
1. Sentiment 140 dataset
2. Consumer complaints
3. Amazon reviews

# Sentiment 140

Sentiment140 dataset comprises of approximately 1.6 million tweets with sentiments label. A Sentiment label of 0 indicates negative sentiment, 2 indicates neutral and 4 indicates positive sentiment. The dataset was labelled by counting number of positive, neutral and negative words in a tweet and by emoticons, for e.g. :( is considered negative. More details on dataset can be found here http://help.sentiment140.com/for-students/

In [61]:
#Read 
sentiment_140_df_train = pd.read_csv('trainingandtestdata/training.1600000.processed.noemoticon.csv',
                                    encoding = "ISO-8859-1", header = None)

In [62]:
#Preview
sentiment_140_df_train.tail()

Unnamed: 0,0,1,2,3,4,5
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best feeling ever
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interviews! â« http://blip.fm/~8bmta
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me for details
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity @SpeakingUpH4H


We only need tweet text, column 5, and labels, column 0. Rest of the columns will be dropped.

In [64]:
#Drop Columns
sentiment_140_df_tweet_label_only = sentiment_140_df_train[[5,0]]

In [65]:
#Preview
#Using both head and tail to inspect data from somewhere in middle rathe than just top and bottom
sentiment_140_df_tweet_label_only.head(100000).tail()

Unnamed: 0,5,0
99995,my son has developed the new habit of waking up at 5:30am I'm on my second POT of coffee,0
99996,looks like my routers broke more tweets from my fone then,0
99997,i really dont want to be in college right now.. wish it was sunny !!,0
99998,@flossa *offers you pepto*,0
99999,@JosieHobo I WOULD SOOOOO BE THERE IF I DIDN'T HAVE REVISION TO DO,0


In [66]:
sentiment_140_df_tweet_label_only.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
5    1600000 non-null object
0    1600000 non-null int64
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [67]:
#Rename columns
sentiment_140_df_tweet_label_only.rename(columns = {0:'label', 5: 'tweet_text'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [68]:
sentiment_140_df_tweet_label_only.tail(1)

Unnamed: 0,tweet_text,label
1599999,happy #charitytuesday @theNSPCC @SparksCharity @SpeakingUpH4H,4


In [72]:
#Verify label data
sentiment_140_df_tweet_label_only['label'].value_counts()

4    800000
0    800000
Name: label, dtype: int64

We see a discrepancy between the actual data and dataset description. 'label' column in actual data comprises of only 2 labels, 0 & 4, and is missing neutral label, 2, mentioned in dataset description. 

In [73]:
#Inspect 0 label 
sentiment_140_df_tweet_label_only[sentiment_140_df_tweet_label_only['label'] 
                                  == 0].head(200000).tail(10)

Unnamed: 0,tweet_text,label
199990,@uhohcaitie aw man I am jealous. can't believe I missed the ~last Sydney show :| i will never forgive myself,0
199991,"@warley I like it too, but i can't use the search function w/ the remote that came with my XPS.",0
199992,@loganculwell sorry I missed the movie last night.,0
199993,has had a very nasty migraine all day and has missed a large quantity of sunshine,0
199994,with my main man. thankfully hes not crazy yet. p.s. i am getting sicker and sicker,0
199995,Doesn't feel good.,0
199996,work... again,0
199997,@damienfranco Its so common for it to crash now I find I have to delete the process then its ok again for a while its eating memory,0
199998,my baby boy is wearing big boy underwear,0
199999,Fml! I forgot my phone charger @home!,0


In [74]:
#Inspect 4 label
sentiment_140_df_tweet_label_only[sentiment_140_df_tweet_label_only['label'] 
                                  == 4].head(10000).tail(10)

Unnamed: 0,tweet_text,label
809990,@ddlovato here are 15:55 GOOD MORNING!,4
809991,"Fast &amp; Furious: New Model, Original Parts a really good film",4
809992,watching Arthur cos I'm way cool,4
809993,@BruceLaBruce lol i'm thinking of moving to Berlin,4
809994,@LaurenFisher obey the last fm algorithm. maybe it would turn out you really do like scooter if you gave him a chance,4
809995,Morning! I have slacked for two days in twittering! But here I am again. Just finished a good run Ready to start a new day.,4
809996,@bensummers Isn't that sweet of them.... Altruism at it's finest....,4
809997,"@jakrose Um, milk *fathers* don't have udders. And &quot;Milk, I am your Mother&quot; just doesn't have same ring. hahaha",4
809998,@zenaweist They could also tweet @BeccaRoberts,4
809999,"Good lord, I still have 125 work emails to catch up on and actually read. That'll teach _me_ to go have a vacation.",4


Based on the manual inspection of labels, it can be inferred that 0 label comprises of mostly negative tweets with some exceptions of neutral tweets like, index # 199998 - 'my baby boy.....'. <br>Similarly tweets labelled 4 comprise of mostly of positive tweets with some exceptions like , index # 809998 - 'They could aslo....'.<br> Hence we will assume 0 as negative and 4 as poistive for our model. Infact positive tweets being a mix of positive and neutral tweets is actually good for the model as the scope of our model is to distinguinsh grievances i.e. complaints, concerns, protests and overly negative tweets from other tweets i.e. positive or neutral. Neutral tweets labelled as 0 however will add some noise, which will be addressed by adding more data points, from other datasets, that comprise of negative texts like complaints and bad reviews.<br>Since the scope of our model is primarily to calssify greivances, we will change all 0 labels to **1 which means that the tweet/text can be classified as greivance** and 4 labels to **0 which means not greivance**. 


In [75]:
#Change labels 4 = 0 & 0 = 1
sentiment_140_df_tweet_label_only['label'] = np.where(sentiment_140_df_tweet_label_only['label']
                                                      == 4, 0, 1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [76]:
#Verify 
display(sentiment_140_df_tweet_label_only.head())
display(sentiment_140_df_tweet_label_only['label'].value_counts())
#Changes have been made correclty

Unnamed: 0,tweet_text,label
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",1
1,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,1
2,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,1
3,my whole body feels itchy and like its on fire,1
4,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",1


1    800000
0    800000
Name: label, dtype: int64

In [77]:
#Write output to csv
sentiment_140_df_tweet_label_only.to_csv('sentiment140_processed.csv',
                                        index = False)

In [78]:
#Verify csv
df = pd.read_csv('sentiment140_processed.csv')
display(df.shape)
display(df.head())

(1600000, 2)

Unnamed: 0,tweet_text,label
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",1
1,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,1
2,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,1
3,my whole body feels itchy and like its on fire,1
4,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",1


In [79]:
#Read test data from sentiment 140 dataset
sentiment_140_df_test =pd.read_csv('trainingandtestdata/testdata.manual.2009.06.14.csv',
                                  header = None)

In [80]:
sentiment_140_df_test.head(2)

Unnamed: 0,0,1,2,3,4,5
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right."
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs is good read.


In [81]:
#Extract tweet and label
sentiment_140_df_test_tweet_label_only = sentiment_140_df_test[[5,0]]


In [82]:
#Rename
sentiment_140_df_test_tweet_label_only.rename(columns = {0:'label', 5:'tweet_text'}, 
                                             inplace = True)
display(sentiment_140_df_test_tweet_label_only.shape)
display(sentiment_140_df_test_tweet_label_only.head())


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


(498, 2)

Unnamed: 0,tweet_text,label
0,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.",4
1,Reading my kindle2... Love it... Lee childs is good read.,4
2,"Ok, first assesment of the #kindle2 ...it fucking rocks!!!",4
3,@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :),4
4,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :),4


In [83]:
#Distinct labels
sentiment_140_df_test_tweet_label_only['label'].value_counts()

4    182
0    177
2    139
Name: label, dtype: int64

Test Dataset for Sentiment 140 does have neutral tweets. Let's inspect each label to get an idea of how well this set has been labelled.

In [84]:
#Positive tweets
sentiment_140_df_test_tweet_label_only[
    sentiment_140_df_test_tweet_label_only['label'] == 4].head(100).tail(10)

Unnamed: 0,tweet_text,label
253,@the_real_usher LeBron is cool. I like his personality...he has good character.,4
254,Watching Lebron highlights. Damn that niggas good,4
255,@Lou911 Lebron is MURDERING shit.,4
256,@uscsports21 LeBron is a monsta and he is only 24. SMH The world ain't ready.,4
257,@cthagod when Lebron is done in the NBA he will probably be greater than Kobe. Like u said Kobe is good but there alot of 'good' players.,4
258,KOBE IS GOOD BT LEBRON HAS MY VOTE,4
260,"@asherroth World Cup 2010 Access?? Damn, that's a good look!",4
261,Just bought my tickets for the 2010 FIFA World Cup in South Africa. Its going to be a great summer. http://bit.ly/9GEZI,4
264,"The great Indian tamasha truly will unfold from May 16, the result day for Indian General Election.",4
265,"@crlane I have the Kindle2. I've seen pictures of the DX, but haven't seen it in person. I love my Kindle - I'm on it everyday.",4


In [85]:
#Neutral tweets
sentiment_140_df_test_tweet_label_only[
    sentiment_140_df_test_tweet_label_only['label'] == 2].head(100).tail(10)

Unnamed: 0,tweet_text,label
371,"UP! was sold out, so i'm seeing Night At The Museum 2. I'm __ years old.",2
375,Obama: Nationalization of GM to be short-term (AP) http://tinyurl.com/md347r,2
386,"Time Warner CEO hints at online fees for magazines (AP) - Read from Mountain View,United States. Views 16209 http://bit.ly/UdFCH",2
399,Lawson to head Newedge Hong Kong http://bit.ly/xLQSD #business #china,2
400,Weird Piano Guitar House in China! http://u2s.me/72i8,2
401,Send us your GM/Chevy photos http://tinyurl.com/luzkpq,2
407,@stevemoakler i had a dentist appt this morning and had the same conversation!,2
411,Check this video out -- David After Dentist http://bit.ly/47aW2,2
412,First dentist appointment [in years] on Wednesday possibly.,2
413,Tom Shanahan's latest column on SDSU and its NCAA Baseball Regional appearance: http://ow.ly/axhu,2


In [86]:
#Negative tweets
sentiment_140_df_test_tweet_label_only[
    sentiment_140_df_test_tweet_label_only['label'] == 0].head(120).tail(10)

Unnamed: 0,tweet_text,label
307,"#wolfram Alpha SUCKS! Even for researchers the information provided is less than you can get from #google or #wikipedia, totally useless!",0
312,@Fraggle312 oh those are awesome! i so wish they weren't owned by nike :(,0
314,"arhh, It's weka bug. = ="" and I spent almost two hours to find that out. crappy me",0
328,Oooooooh... North Korea is in troubleeeee! http://bit.ly/19epAH,0
329,Wat the heck is North Korea doing!!??!! They just conducted powerful nuclear tests! Follow the link: http://www.msnbc.msn.com/id/30921379,0
330,Listening to Obama... Friggin North Korea...,0
331,"I just realized we three monkeys in the white Obama.Biden,Pelosi . Sarah Palin 2012",0
332,@foxnews Pelosi should stay in China and never come back.,0
333,Nancy Pelosi gave the worst commencement speech I've ever heard. Yes I'm still bitter about this,0
334,ugh. the amount of times these stupid insects have bitten me. Grr..,0


Manual inspection of each label indicates that the datatset is fairly well labelled. Hence, we must definitely use these observations for our model. Following the labelling standards for our model, all 0s i.e. negatives will be labelled as 1, greivance and 4 & 2 will be labelled as 0 i.e. non greivance. 

In [87]:
#Rename labels 0 = 1 , 4 & 2 = 0
sentiment_140_df_test_tweet_label_only['label'] = np.where(
sentiment_140_df_test_tweet_label_only['label'] == 0, 1, 0)

sentiment_140_df_test_tweet_label_only.tail(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,tweet_text,label
496,"Trouble in Iran, I see. Hmm. Iran. Iran so far away. #flockofseagullsweregeopoliticallycorrect",1
497,Reading the tweets coming out of Iran... The whole thing is terrifying and incredibly sad...,1


In [88]:
#Write dataset to csv
sentiment_140_df_test_tweet_label_only.to_csv('sentiment140_neutlabels.csv', index = False)

In [89]:
#Verify csv
df = pd.read_csv('sentiment140_neutlabels.csv')
display(df.shape)
display(df.tail())

(498, 2)

Unnamed: 0,tweet_text,label
493,Ask Programming: LaTeX or InDesign?: submitted by calcio1 [link] [1 comment] http://tinyurl.com/myfmf7,0
494,"On that note, I hate Word. I hate Pages. I hate LaTeX. There, I said it. I hate LaTeX. All you TEXN3RDS can come kill me now.",1
495,Ahhh... back in a *real* text editing environment. I &lt;3 LaTeX.,0
496,"Trouble in Iran, I see. Hmm. Iran. Iran so far away. #flockofseagullsweregeopoliticallycorrect",1
497,Reading the tweets coming out of Iran... The whole thing is terrifying and incredibly sad...,1


# Consumer complaints

Consumer complaints dataset comprises of consumer complaints against financila products and services. More details on datset can be found here https://www.kaggle.com/cfpb/us-consumer-finance-complaints. This dataset does contains only complaints and hence will only contribute towards classification of greivances i.e. label 1 and does not provide any information on positive or negative narratives. However, it is still very important as it provides ideal instances for our target class.

In [47]:
consumer_complaints =  pd.read_csv('Consumer_Complaints.csv')

In [48]:
consumer_complaints.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,3/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,3/17/2014,Closed with explanation,Yes,No,759217
1,10/1/2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements,Company has responded to the consumer and the CFPB and chooses not to provide a public response,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10/5/2016,Closed with explanation,Yes,No,2141773
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,"I purchased a new car on XXXX XXXX. The car dealer called Citizens Bank to get a 10 day payoff on my loan, good till XXXX XXXX. The dealer sent the check the next day. When I balanced my checkbook on XXXX XXXX. I noticed that Citizens bank had taken the automatic payment out of my checking account at XXXX XXXX XXXX Bank. I called Citizens and they stated that they did not close the loan until XXXX XXXX. ( stating that they did not receive the check until XXXX. XXXX. ). I told them that I did not believe that the check took that long to arrive. XXXX told me a check was issued to me for the amount overpaid, they deducted additional interest. Today ( XXXX XXXX, ) I called Citizens Bank again and talked to a supervisor named XXXX, because on XXXX XXXX. I received a letter that the loan had been paid in full ( dated XXXX, XXXX ) but no refund check was included. XXXX stated that they hold any over payment for 10 business days after the loan was satisfied and that my check would be mailed out on Wed. the XX/XX/XXXX.. I questioned her about the delay in posting the dealer payment and she first stated that sometimes it takes 3 or 4 business days to post, then she said they did not receive the check till XXXX XXXX I again told her that I did not believe this and asked where is my money. She then stated that they hold the over payment for 10 business days. I asked her why, and she simply said that is their policy. I asked her if I would receive interest on my money and she stated no. I believe that Citizens bank is deliberately delaying the posting of payment and the return of consumer 's money to make additional interest for the bank. If this is not illegal it should be, it does hurt the consumer and is not ethical. My amount of money lost is minimal but if they are doing this on thousands of car loans a month, then the additional interest earned for them could be staggering. I still have another car loan from Citizens Bank and I am afraid when I trade that car in another year I will run into the same problem again.",,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100
3,6/8/2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854,Older American,,Web,6/10/2014,Closed with explanation,Yes,Yes,885638
4,9/13/2014,Debt collection,Credit card,Communication tactics,Frequent or repeated calls,,,"CITIBANK, N.A.",VA,23233,,,Web,9/13/2014,Closed with explanation,Yes,Yes,1027760


In [49]:
consumer_complaints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903983 entries, 0 to 903982
Data columns (total 18 columns):
Date received                   903983 non-null object
Product                         903983 non-null object
Sub-product                     668823 non-null object
Issue                           903983 non-null object
Sub-issue                       426386 non-null object
Consumer complaint narrative    199970 non-null object
Company public response         257981 non-null object
Company                         903983 non-null object
State                           894758 non-null object
ZIP code                        894705 non-null object
Tags                            126038 non-null object
Consumer consent provided?      375434 non-null object
Submitted via                   903983 non-null object
Date sent to company            903983 non-null object
Company response to consumer    903983 non-null object
Timely response?                903983 non-null object
Consumer 

`Consumer complaint narrative` is the only column of interest and comprises of 199970 non null values. All null rows will be dropped. Every other column will be dropped and all rows will be labelled as 1, greiavance, by creating a label column.

In [50]:
#Extract consumer complaint narrative
consumer_complaint_narrative_df = pd.DataFrame(consumer_complaints['Consumer complaint narrative'])
consumer_complaint_narrative_df.head()

Unnamed: 0,Consumer complaint narrative
0,
1,I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements
2,"I purchased a new car on XXXX XXXX. The car dealer called Citizens Bank to get a 10 day payoff on my loan, good till XXXX XXXX. The dealer sent the check the next day. When I balanced my checkbook on XXXX XXXX. I noticed that Citizens bank had taken the automatic payment out of my checking account at XXXX XXXX XXXX Bank. I called Citizens and they stated that they did not close the loan until XXXX XXXX. ( stating that they did not receive the check until XXXX. XXXX. ). I told them that I did not believe that the check took that long to arrive. XXXX told me a check was issued to me for the amount overpaid, they deducted additional interest. Today ( XXXX XXXX, ) I called Citizens Bank again and talked to a supervisor named XXXX, because on XXXX XXXX. I received a letter that the loan had been paid in full ( dated XXXX, XXXX ) but no refund check was included. XXXX stated that they hold any over payment for 10 business days after the loan was satisfied and that my check would be mailed out on Wed. the XX/XX/XXXX.. I questioned her about the delay in posting the dealer payment and she first stated that sometimes it takes 3 or 4 business days to post, then she said they did not receive the check till XXXX XXXX I again told her that I did not believe this and asked where is my money. She then stated that they hold the over payment for 10 business days. I asked her why, and she simply said that is their policy. I asked her if I would receive interest on my money and she stated no. I believe that Citizens bank is deliberately delaying the posting of payment and the return of consumer 's money to make additional interest for the bank. If this is not illegal it should be, it does hurt the consumer and is not ethical. My amount of money lost is minimal but if they are doing this on thousands of car loans a month, then the additional interest earned for them could be staggering. I still have another car loan from Citizens Bank and I am afraid when I trade that car in another year I will run into the same problem again."
3,
4,


In [51]:
#Drop NaN
consumer_complaint_narrative_df.dropna(inplace = True)
display(consumer_complaint_narrative_df.head(2))
print(f"Shape of consumer narrative after droppping nulls is: {consumer_complaint_narrative_df.shape}")

Unnamed: 0,Consumer complaint narrative
1,I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements
2,"I purchased a new car on XXXX XXXX. The car dealer called Citizens Bank to get a 10 day payoff on my loan, good till XXXX XXXX. The dealer sent the check the next day. When I balanced my checkbook on XXXX XXXX. I noticed that Citizens bank had taken the automatic payment out of my checking account at XXXX XXXX XXXX Bank. I called Citizens and they stated that they did not close the loan until XXXX XXXX. ( stating that they did not receive the check until XXXX. XXXX. ). I told them that I did not believe that the check took that long to arrive. XXXX told me a check was issued to me for the amount overpaid, they deducted additional interest. Today ( XXXX XXXX, ) I called Citizens Bank again and talked to a supervisor named XXXX, because on XXXX XXXX. I received a letter that the loan had been paid in full ( dated XXXX, XXXX ) but no refund check was included. XXXX stated that they hold any over payment for 10 business days after the loan was satisfied and that my check would be mailed out on Wed. the XX/XX/XXXX.. I questioned her about the delay in posting the dealer payment and she first stated that sometimes it takes 3 or 4 business days to post, then she said they did not receive the check till XXXX XXXX I again told her that I did not believe this and asked where is my money. She then stated that they hold the over payment for 10 business days. I asked her why, and she simply said that is their policy. I asked her if I would receive interest on my money and she stated no. I believe that Citizens bank is deliberately delaying the posting of payment and the return of consumer 's money to make additional interest for the bank. If this is not illegal it should be, it does hurt the consumer and is not ethical. My amount of money lost is minimal but if they are doing this on thousands of car loans a month, then the additional interest earned for them could be staggering. I still have another car loan from Citizens Bank and I am afraid when I trade that car in another year I will run into the same problem again."


Shape of consumer narrative after droppping nulls is: (199970, 1)


In [52]:
#Add label
consumer_complaint_narrative_df['label'] = 1
display(consumer_complaint_narrative_df.tail(2))
display(consumer_complaint_narrative_df['label'].value_counts())


Unnamed: 0,Consumer complaint narrative,label
903980,"I was contacted on XX/XX/XXXX email by XXXX from Caliber Home Loans to refinance my current loan with them. I replied on XX/XX/XXXX he gave me an acceptable estimate for refinancing my condo which was already mortgaged with Caliber. XXXX then proceeded to try to get me to refinance my primary home. He presented another favorable quote so I proceeded. We started the process and got both appraisals done and I provided all of the requested documents. The appraisals came back lower than the estimates so XXXX reworked the estimates and I approved and we moved on. This all happened before XXXX. I got a call from my first loan processor XXXX on XX/XX/XXXX herself and stating she was beginning the next phase of my loans. I have never heard another word from XXXX. I began trying to get a status of my loans on XX/XX/XXXX to allow extra time for the holidays. I called and emailed both XX/XX/XXXX and XX/XX/XXXX. Neither of them EVER replied or returned my phone calls. I tried emailing and calling multiple times every day. I called their main number but they were useless. I tried logging onto my account to check the status of my loans but I could n't even access my account because of some IT problem they were having. I finally got someone on the XXXX number to give me someone else 's number. So I started calling XXXX who was supposedly XXXX boss but he never called me back either. On XX/XX/XXXX I got a call back from XXXX and he has been the most responsive but he has gone silent since the holidays. I was then assigned a new loan processor XX/XX/XXXX who basically started all over. I was told my new loan details after the appraisals came back lower were never entered into their system even though I had signed the new loans before XXXX. XXXX sent new estimates and my closing costs were higher even though I was borrowing less. XXXX said it was because my properties were entered as something other than condos and the condo closing costs are higher. He also refused to reimburse me for the appraisals. He said I would get "" credit for paying '' them at closing but I would not get reimbursed as XXXX had told me. He also is not honoring the lender credit I was given. I had locked in my interest rates which expired on XX/XX/2017. Supposedly Caliber extended the rate lock until XX/XX/XXXX but I have only received an email about the rate lock. I have not seen any official documentation. Late last week XXXX said there is a problem with the HOA on my primary property because of pending litigation but that it could still be handled and I could close on both properties if the right departments from Caliber was involved. I have not heard a word from XXXX since and the only communication I received at all was from XXXX 's boss yesterday in an informal email about my rate lock extension. \nI know this is a lengthy story but here are my complaints : 1. Deceitful marketing practices. I am confident XXXX intentionally entered my properties into their system as something other than condos to make the deals more attractive in order to get me to refinance. The reason I am confident is because they already hold the loan on one of my properties. \n2. Not honoring the agreement of reimbursing me for the appraisals and not giving me the lender credit as promised. \n3. The total and complete lack of professionalism in their lack of communication with their customer, me. This is the absolute worst experience I have ever had with any institution wanting my business for anything.",1
903982,"I had a debit that was included in my chapter XXXX BK, almost two years letter this item showed on my credit reports under collection status for Midwest Recovery Systems. This dropped my credit score XXXX points. I called them and they said their client had n't informed them that it was included in BK, but the damage had already been done. It took them 30 days to remove this incorrectly put item on my credit reports. Its still showing up on my XXXX. I thought this was against the law for them to do that since I am protected by BK laws. They should be fined for this. I should sue.",1


1    199970
Name: label, dtype: int64

In [54]:
#Write to csv
consumer_complaint_narrative_df.to_csv('consumer_complaints_processed.csv', index = False)

In [56]:
#Verify csv
df = pd.read_csv('consumer_complaints_processed.csv')
display(df.shape)
display(df.tail())

(199970, 2)

Unnamed: 0,Consumer complaint narrative,label
199965,"Our son was taken to XXXX XXXX XXXX XXXX XXXX XXXX on XXXX XXXX, 2012 as an ER visit. We had insurance through XXXX XXXX at the time and the hospital failed to submit the claim to our insurance company, and has been asking us to pay out of pocket for the services even though the services were covered under our insurance. XXXX says they submitted a claim for {$1200.00} to the insurance company, and when I asked XXXX to explain to me why it was denied, they could n't provide a reason other than simply saying that it "" was n't paid '' ; I believe it "" was n't paid '' because it was never sent out not to mention that XXXX has no record of this claim amount for XXXX 2012 for our son. \r\n\r\nMy husband I followed up numerous times with the hospital asking them to resubmit the appropriate paperwork to the insurance and they never sent the claim to the insurance, even though they reassured us they had. We tried everything possible to get the XXXX parties connected and appropriate documents sent over, but we are the middle men. I spoke with XXXX at XXXX today, and she confirmed that the only claim they received for our son for services in XXXX 2012 was a claim totaling {$670.00}, of which we were responsible for {$250.00} ( {$220.00} deductible amt + {$30.00} co-pay ). XXXX does not show any claim for {$1200.00} ; therefore it was never sent/received.",1
199966,"On XXXX/XXXX/13, without my authorization, Bank of America withdrew {$29000.00} from my personal account to charge off my company credit card ( XXXX - Bank of America ). My corporation had a separate bank account and a Business card account with Bank of America not linked with my personal account.",1
199967,"I had an account with XXXX in XX/XX/XXXX this was previously disputed for XXXX $ $ because at & t sold their towers in the area of my employer so my father and I whom both work here could not receive any phone calls while in our job, XXXX agreed and deleted it from our credit report. Now they 're saying I owe a combined about of XXXX sounds like XXXX is trying to combine the XXXX XXXX my father and I owed for termination onto my report only.",1
199968,"I was contacted on XX/XX/XXXX email by XXXX from Caliber Home Loans to refinance my current loan with them. I replied on XX/XX/XXXX he gave me an acceptable estimate for refinancing my condo which was already mortgaged with Caliber. XXXX then proceeded to try to get me to refinance my primary home. He presented another favorable quote so I proceeded. We started the process and got both appraisals done and I provided all of the requested documents. The appraisals came back lower than the estimates so XXXX reworked the estimates and I approved and we moved on. This all happened before XXXX. I got a call from my first loan processor XXXX on XX/XX/XXXX herself and stating she was beginning the next phase of my loans. I have never heard another word from XXXX. I began trying to get a status of my loans on XX/XX/XXXX to allow extra time for the holidays. I called and emailed both XX/XX/XXXX and XX/XX/XXXX. Neither of them EVER replied or returned my phone calls. I tried emailing and calling multiple times every day. I called their main number but they were useless. I tried logging onto my account to check the status of my loans but I could n't even access my account because of some IT problem they were having. I finally got someone on the XXXX number to give me someone else 's number. So I started calling XXXX who was supposedly XXXX boss but he never called me back either. On XX/XX/XXXX I got a call back from XXXX and he has been the most responsive but he has gone silent since the holidays. I was then assigned a new loan processor XX/XX/XXXX who basically started all over. I was told my new loan details after the appraisals came back lower were never entered into their system even though I had signed the new loans before XXXX. XXXX sent new estimates and my closing costs were higher even though I was borrowing less. XXXX said it was because my properties were entered as something other than condos and the condo closing costs are higher. He also refused to reimburse me for the appraisals. He said I would get "" credit for paying '' them at closing but I would not get reimbursed as XXXX had told me. He also is not honoring the lender credit I was given. I had locked in my interest rates which expired on XX/XX/2017. Supposedly Caliber extended the rate lock until XX/XX/XXXX but I have only received an email about the rate lock. I have not seen any official documentation. Late last week XXXX said there is a problem with the HOA on my primary property because of pending litigation but that it could still be handled and I could close on both properties if the right departments from Caliber was involved. I have not heard a word from XXXX since and the only communication I received at all was from XXXX 's boss yesterday in an informal email about my rate lock extension. \r\nI know this is a lengthy story but here are my complaints : 1. Deceitful marketing practices. I am confident XXXX intentionally entered my properties into their system as something other than condos to make the deals more attractive in order to get me to refinance. The reason I am confident is because they already hold the loan on one of my properties. \r\n2. Not honoring the agreement of reimbursing me for the appraisals and not giving me the lender credit as promised. \r\n3. The total and complete lack of professionalism in their lack of communication with their customer, me. This is the absolute worst experience I have ever had with any institution wanting my business for anything.",1
199969,"I had a debit that was included in my chapter XXXX BK, almost two years letter this item showed on my credit reports under collection status for Midwest Recovery Systems. This dropped my credit score XXXX points. I called them and they said their client had n't informed them that it was included in BK, but the damage had already been done. It took them 30 days to remove this incorrectly put item on my credit reports. Its still showing up on my XXXX. I thought this was against the law for them to do that since I am protected by BK laws. They should be fined for this. I should sue.",1


# Amazon Reviews

Amazon reviews dataset, http://sifaka.cs.uiuc.edu/~wang296/Data/index.html, has been created by web scraping amazon reviews across six categories as below.
1. cameras
2. laptops
3. mobilephone
4. tablets
5. TVs
6. video_surveillance

Each category has multiple json files, each with a collection of reviews .

In [20]:
#Inspect json file
review = json.load(open('AmazonReviews/cameras/143546026X.json'))
display(type(review))
display(review)
#Nested dictionary

dict

{'Reviews': [{'Title': "This book's a lifesaver",
   'Author': 'C. Juszak',
   'ReviewID': 'R2MCX553KZI0WY',
   'Overall': '5.0',
   'Content': 'While the T3 is a pretty easy camera to use, there are a lot of hidden features and tricks I would probably never have discovered without this book. Chapter 3 starts out with..."One thing that surprises new owners of the Canon EOS Rebel T3 is that the camera has a total of 496 buttons, dials, switches, levers, latches, and knobs bristling from it\'s surface. Okay, I lied. Actually, the real number is closer to two dozen controls and adjustments, but that\'s still a lot of components to master, especially when you consider that many of these controls serve double-duty to give you access to multiple functions."This quote serves two purposes here; to illustrate that this is not just a dry reference manual and to point out that there are in fact quite a few bells and whistles that the camera has to offer. I\'ve actually found this book to be an en

In [21]:
display(type(review['Reviews']))
display(review['Reviews'])
#Value of key 'Reviews' is a list of review dictionaries 

list

[{'Title': "This book's a lifesaver",
  'Author': 'C. Juszak',
  'ReviewID': 'R2MCX553KZI0WY',
  'Overall': '5.0',
  'Content': 'While the T3 is a pretty easy camera to use, there are a lot of hidden features and tricks I would probably never have discovered without this book. Chapter 3 starts out with..."One thing that surprises new owners of the Canon EOS Rebel T3 is that the camera has a total of 496 buttons, dials, switches, levers, latches, and knobs bristling from it\'s surface. Okay, I lied. Actually, the real number is closer to two dozen controls and adjustments, but that\'s still a lot of components to master, especially when you consider that many of these controls serve double-duty to give you access to multiple functions."This quote serves two purposes here; to illustrate that this is not just a dry reference manual and to point out that there are in fact quite a few bells and whistles that the camera has to offer. I\'ve actually found this book to be an enjoyable read; ho

Relevant keys within review dictionary:
1. 'Overall' - rating on a scale of 5, to label the review as greivance(1) or 0
2. 'Content' - review narrative

In [6]:
# Create a list that comprises of folder for each category
cat_folder_ls = []
for dirpath, dirnames, filenames in os.walk('AmazonReviews'):
    cat_folder_ls.append(dirpath)
print(cat_folder_ls)

['AmazonReviews', 'AmazonReviews\\cameras', 'AmazonReviews\\laptops', 'AmazonReviews\\mobilephone', 'AmazonReviews\\tablets', 'AmazonReviews\\TVs', 'AmazonReviews\\video_surveillance']


In [11]:
#Extract review and rating from each category and concat to master df
master_df = pd.DataFrame() 
#Loop over each category
for category in cat_folder_ls[1:]: #Excluding 1st value i.e. root folder 'AmazonReviews'
    print (category)
    #Loop over each json file under a category
    for filename in os.listdir(category):
        #Read json
        file_path = os.path.join(category, filename)
        product_info = json.load(open(file_path))
        
        #Exclude empty reviews list
        if product_info['Reviews'] != []:
            #List of extracted review dictionaries can be directly converted to a df
            temp_df = pd.DataFrame(product_info['Reviews'])
            #Extracting content and rating 
            temp_df = temp_df[['Content', 'Overall']]
            #Append to master_df
            master_df = pd.concat([master_df, temp_df], axis = 0, ignore_index= True)
   


AmazonReviews\cameras
AmazonReviews\laptops
AmazonReviews\mobilephone
AmazonReviews\tablets
AmazonReviews\TVs
AmazonReviews\video_surveillance


In [13]:
#Write output to a csv
master_df.to_csv('amazonreviews_extract.csv', index = False)

In [90]:
#Verify csv
df = pd.read_csv('amazonreviews_extract.csv')
display(df.shape)
display(df.head())


  interactivity=interactivity, compiler=compiler, result=result)


(1116526, 2)

Unnamed: 0,Content,Overall
0,"English teachers who work with native speakers of other languages will be thrilled with these flash cards. They are handsomely printed on durable stock with a protective coating for years of shuffling. I've been making my own for over twenty years, but the result has never approached the artful and clear design of these sturdy and compact cards. Students of linguistics should also find these cards useful for developing fluency in IPA transcription.The transcription of the example words for English is a nice compromise between overly broad and unnecessarily narrow. Diacritics, for example, are largely limited to markers of length and syllabicity. A General American accent is the model; primary and secondary stress are marked, as are syllable boundaries. For most ESOL/EFL teachers (at least in the US), the 43 cards covering all the sounds of American English will get the most use, and they are conveniently color-coded (Deck 2) to distinguish them from the rest. Linguistics students will likely spend more time with the other two decks, one with the symbol on one side and its phonetic description on the reverse, and the other giving the names for the individual symbols. The decision to separate into decks and package all three decks together was both generous and wise: for a slightly higher price, you have the capability to accurately describe the languages of any of your students, even if they hail from South Sudan!One might complain that a tool for teaching the sounds of a language should actually make those sounds audibly. (Perhaps the folks at Minokidowinan already have an app for the iPhone in the works!) But really, the task to which these flashcards are best suited is improving the fluency of phonetic transcription, which is, after all, necessarily a written exercise.",5
1,"While I have a pretty limited knowledge of phonetics/linguistics as a formal discipline, I entered a drawing to win a set of these cards, because although I have rudimentary knowledge about the IPA, I have always been very interested in it. These cards make learning and understanding the international phonetic alphabet accessible, and not intimidating. The three different decks break down each symbol into three components- the phonetic description, transcriptions, and the symbol name. This breakdown allows you to focus on studying one component without being inundated with too much information, which is helpful if you've just been introduced to the IPA/ phonetics study, and allows you advance your study at your own rate. The cards are clear and concise, and laid out nicely! I've really been enjoying using and learning these cards!",5
2,"I'm starting my Masters in Teaching English as a Second Language so I wanted something to help me with studying the phonetic alphabet. These cards are ok, but I wish for the price they would have a little more information. They give you the mere basics you would expect from flash cards and do not give much more explanation. That would be fine if I wasn't paying almost 60$ for them. There are many complex terms on the cards that they do not explain at all. SO not a bad set, but I would have liked them to be a bit more in depth than they are. Other than that they are made well and have a good organization system.",4
3,"As one with a Bachelor's in Linguistics and much exposure to the world's languages, I have to say these flash cards are excellent. They are broken down into 3 different flash card types, each one helpful in its own way to mastering the phonetics of the world's languages. My only disclaimer would be that these cards do require some base-level prerequisite linguistic knowledge. For example, one set of the cards provides the IPA symbol on one side and the description of the sound on the other. If you are not familiar with basic phrases used in phonetics like ""central, back, closed, rounded,"" etc... you may have to do some research before using the cards, because there is not an in-depth explanation on the cards. It is this simplicity and organization of the cards that I actually greatly appreciate. Lastly, these cards may seem a little pricey... I will say that I think it is worth it. These cards are made to last and they are a great tool for learning, teaching, using with non-native speakers of a language, using for students learning other languages, etc. I would recommend these cards to anyone interested in linguistics, learning a new language, or teaching.",5
4,I purchased these flash cards to get a head start on my Speech and hearing class in International Phonetic Alphabet study. These flash cards are perfect and I will get a lot of use out of them.,5
