##**We need to install gensim to get access to Word2Vec model**

In [90]:
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


###**Reading and Exploring the Dataset**

***Dataset Context:***

The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. An NLP model would make the classification of complaints and their routing to the appropriate teams more efficient than manually tagged complaints.

Link to the Dataset: https://www.kaggle.com/datasets/shashwatwork/consume-complaints-dataset-fo-nlp?resource=download

In [91]:
!unzip archive.zip

Archive:  archive.zip
replace complaints_processed.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: complaints_processed.csv  


##**Importing the necessary packages**

In [92]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import gensim


###Creating pandas dataframe from the extracted csv file

In [93]:
df = pd.read_csv("complaints_processed.csv")

In [94]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,product,narrative
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...


In [95]:
df["narrative"][0]

'purchase order day shipping amount receive product week sent followup email exact verbiage paid two day shipping received order company responded im sorry inform due unusually high order volume order shipped several week stock since early due high demand although continuing take order guaranteeing receive order place due time mask order exact shipping date right however guarantee ship soon soon delivers product u getting small shipment shipping first come first served basis appreciate patience fulfill order quickly recommend keeping order lose place line cancel distributor stock moment prefer cancel please note ask via email cancel accordance cancellation policy agreed checkout electronic inventory online requested order canceled refund issued canceled order sent verification order canceled refunded item particulate respirator refunded subtotal shipping tax total usd visa ending refund called disputed amount stated nothing needed submitted address issue recharged item removing called 

In [96]:
df.describe

<bound method NDFrame.describe of         Unnamed: 0           product  \
0                0       credit_card   
1                1       credit_card   
2                2    retail_banking   
3                3  credit_reporting   
4                4  credit_reporting   
...            ...               ...   
162416      162416   debt_collection   
162417      162417       credit_card   
162418      162418   debt_collection   
162419      162419       credit_card   
162420      162420  credit_reporting   

                                                narrative  
0       purchase order day shipping amount receive pro...  
1       forwarded message date tue subject please inve...  
2       forwarded message cc sent friday pdt subject f...  
3       payment history missing credit report speciali...  
4       payment history missing credit report made mis...  
...                                                   ...  
162416                                               name  
16241

In [97]:
df.drop(df.columns[[0]],axis=1,inplace=True)

In [98]:
df.columns

Index(['product', 'narrative'], dtype='object')

In [99]:
df.head()

Unnamed: 0,product,narrative
0,credit_card,purchase order day shipping amount receive pro...
1,credit_card,forwarded message date tue subject please inve...
2,retail_banking,forwarded message cc sent friday pdt subject f...
3,credit_reporting,payment history missing credit report speciali...
4,credit_reporting,payment history missing credit report made mis...


In [100]:
df.shape

(162421, 2)

In [101]:
type(df['narrative'][0])

str

In [102]:
df['narrative'][0][:150]

'purchase order day shipping amount receive product week sent followup email exact verbiage paid two day shipping received order company responded im s'

###**Now that the data is extracted into a dataset, we will have to preprocess our data before getting the word vectors**

In [103]:
# We will use gensim's built in functions to pre-process our text

In [104]:
## Example
gensim.utils.simple_preprocess('purchase order day shipping amount receive product week sent followup email exact verbiage paid two day shipping received order company responded im s')

['purchase',
 'order',
 'day',
 'shipping',
 'amount',
 'receive',
 'product',
 'week',
 'sent',
 'followup',
 'email',
 'exact',
 'verbiage',
 'paid',
 'two',
 'day',
 'shipping',
 'received',
 'order',
 'company',
 'responded',
 'im']

In [105]:
f = 'There is only string type'
for i in df['narrative']:
  if type(i)!= type('hello'):
    print(i,type(i))
    f = "There are types other than strings"
    break
print(f)


nan <class 'float'>
There are types other than strings


In [110]:
df['narrative'].isnull().sum()

10

In [112]:
df = df['narrative'].dropna(axis=0)

In [116]:
type(df)

pandas.core.series.Series

In [117]:
df.head()

0    purchase order day shipping amount receive pro...
1    forwarded message date tue subject please inve...
2    forwarded message cc sent friday pdt subject f...
3    payment history missing credit report speciali...
4    payment history missing credit report made mis...
Name: narrative, dtype: object

In [118]:
narrative_text = df.apply(gensim.utils.simple_preprocess)

In [119]:
narrative_text

0         [purchase, order, day, shipping, amount, recei...
1         [forwarded, message, date, tue, subject, pleas...
2         [forwarded, message, cc, sent, friday, pdt, su...
3         [payment, history, missing, credit, report, sp...
4         [payment, history, missing, credit, report, ma...
                                ...                        
162416                                               [name]
162417                                               [name]
162418                                               [name]
162419                                               [name]
162420                                               [name]
Name: narrative, Length: 162411, dtype: object

In [121]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4
)

In [122]:
model.build_vocab(narrative_text,progress_per=1000)

In [123]:
model.epochs

5

In [124]:
model.corpus_count

162411

In [125]:
model.train(narrative_text, total_examples = model.corpus_count, epochs=model.epochs)

(57563312, 64871100)

In [127]:
model.wv.most_similar('bad')

[('good', 0.5826908349990845),
 ('ruined', 0.579811692237854),
 ('horrible', 0.5583643913269043),
 ('unnerving', 0.5000274181365967),
 ('fishy', 0.4982348680496216),
 ('big', 0.49119189381599426),
 ('poor', 0.4901562035083771),
 ('punished', 0.49004966020584106),
 ('sad', 0.48370835185050964),
 ('excellent', 0.4806368947029114)]

In [128]:
model.wv.similarity(w1='email',w2='letter')

0.50910693

In [136]:
model.wv.similarity(w1='cheap',w2='cheap')

1.0