# Capstone 3 - Data Collection

### Project Background
    
    Customer satisfaction at financial institutions is taken very seriously but data may not always be leveraged as well as it can be. I remember when I supported a call center function at a large bank, the leaders of the call center would personally read individual complaints to try and glean valuable information and take action accordingly. Using NLP to analyze the type and severity of complaints can uncover serious issues with certain business functions that are negatively impacting customer satisfaction.
    
### Data

    The data I used in this project came from the Consumer Financial Protection Bureau complaints dataset consisting of 1.8 million complaints, of which about 600k had written complaints. 
    
### Machine Learning Methods

    I plan to use Natural Language Processing to classify the product alignment of each complaint as well as the severity. This will not only give me information on what business function is driving the highest number of complaints, but will also direct me to which ones are most serious and where I must first direct my attention.

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
from time import time
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams
from textblob import TextBlob
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
import spacy
# download spacy english word vectors
nlp = spacy.load('en_core_web_lg')


In [40]:
# exploratory
doc = nlp("We saw how to use different pre-built libraries for sentiment analysis using an unsupervised approach. We tried describing all the three packages in discussion i.e. IBM Watson Sentiment Analyzer, TextBlob & NLTK VADER-based sentiment analyzer. Also, we tried to explain how to use these successfully in Python.")
print(doc.ents[1], doc.ents[1].label_)

IBM ORG


In [86]:
print(os.getcwd())
data_path = "/Users/joeboardman/Downloads/"
save_path = "/Users/joeboardman/Documents/Springboard/GitHub/CapstoneThree/"
os.chdir(data_path)

/Users/joeboardman/Downloads


In [42]:
df = pd.read_csv(data_path + 'Consumer_Complaints.csv', nrows=50000) # smaller dataset - ~900k rows - 
#df = pd.read_csv(data_path + 'complaints.csv') # larger dataset - ~ 1.8m rows
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,3/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,3/17/2014,Closed with explanation,Yes,No,759217
1,10/1/2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10/5/2016,Closed with explanation,Yes,No,2141773
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100
3,6/8/2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854,Older American,,Web,6/10/2014,Closed with explanation,Yes,Yes,885638
4,9/13/2014,Debt collection,Credit card,Communication tactics,Frequent or repeated calls,,,"CITIBANK, N.A.",VA,23233,,,Web,9/13/2014,Closed with explanation,Yes,Yes,1027760


In [43]:
df['Date received'].max()

'9/9/2016'

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Date received                 50000 non-null  object
 1   Product                       50000 non-null  object
 2   Sub-product                   35062 non-null  object
 3   Issue                         50000 non-null  object
 4   Sub-issue                     20963 non-null  object
 5   Consumer complaint narrative  10541 non-null  object
 6   Company public response       13462 non-null  object
 7   Company                       50000 non-null  object
 8   State                         49588 non-null  object
 9   ZIP code                      49586 non-null  object
 10  Tags                          6858 non-null   object
 11  Consumer consent provided?    20011 non-null  object
 12  Submitted via                 50000 non-null  object
 13  Date sent to com

In [45]:
# drop needless columns
df.drop(['Date received', 'Sub-product', 'Sub-issue', 'ZIP code', 'Tags', 'Consumer consent provided?',
        'Submitted via', 'Date sent to company', 'Timely response?', "Company public response"], 
        axis=1, inplace=True)

In [46]:
# rename columns to remove spaces
df = df.rename(columns={"Consumer complaint narrative": "complaint_text", 
                        "Company response to consumer":"company_response",
                        "Consumer disputed?":"disputed", 
                        "Complaint ID":"complaint_ID"})


In [47]:
# drop rows without a written complaint
df = df.dropna(subset=['complaint_text']).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10541 entries, 0 to 10540
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Product           10541 non-null  object
 1   Issue             10541 non-null  object
 2   complaint_text    10541 non-null  object
 3   Company           10541 non-null  object
 4   State             10516 non-null  object
 5   company_response  10541 non-null  object
 6   disputed          10537 non-null  object
 7   complaint_ID      10541 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 658.9+ KB


In [48]:
df[df.disputed == 'Yes'].iloc[2,2]

'I have disputed several accounts on my credit report with Equifax several times. I first disputed with the original creditors, XXXX, XXXX XXXX, and XXXX XXXX Solutions and after 3 letters did not receive any correspondence with proof. I sent copies of the letters and certified mail receipts to the credit bureau but they continue to verify these debts without copies of any proof with my signature. A statement of dispute was also never put on my credit report. Equifax only continues to update the accounts associated with these creditors.'

In [49]:
# different products - notice duplication (i.e. 'payday loan' and 'payday loan, title loan, or personal loan')
df['Product'].value_counts()

Debt collection                                                                 2558
Mortgage                                                                        2094
Credit reporting                                                                2028
Credit card                                                                     1231
Bank account or service                                                          938
Student loan                                                                     759
Consumer Loan                                                                    602
Payday loan                                                                      110
Money transfers                                                                  103
Prepaid card                                                                      90
Other financial service                                                           25
Virtual currency                                                 

In [50]:
# consolidate duplicative products
df.loc[df.Product == 'Virtual currency', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Mortgage', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Student Loan', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Payday loan, title loan, or personal loan', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Payday loan', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Vehicle loan or lease', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Student loan', ['Product']] = 'Consumer Loan'
df.loc[df.Product == 'Prepaid card', ['Product']] = 'Credit card or prepaid card'
df.loc[df.Product == 'Money transfers', ['Product']] = 'Other financial service'
df.loc[df.Product == 'Credit reporting, credit repair services, or other personal consumer reports', ['Product']] = 'Credit reporting'
df.loc[df.Product == 'Credit card', ['Product']] = 'Credit card or prepaid card'
df.loc[df.Product == 'Checking or savings account', ['Product']] = 'Bank account or service'
df.loc[df.Product == 'Money transfer, virtual currency, or money service', ['Product']] = 'Other financial service'

df.Product.value_counts()

Consumer Loan                  3567
Debt collection                2558
Credit reporting               2029
Credit card or prepaid card    1321
Bank account or service         938
Other financial service         128
Name: Product, dtype: int64

In [51]:
# check for duplicates
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF
# drop duplicates
#df.drop_duplicates(keep='first', inplace=True)

Unnamed: 0,Product,Issue,complaint_text,Company,State,company_response,disputed,complaint_ID


In [81]:
# change 'disputed' from yes/no to 1/0
pd.Series(np.where(df.disputed.values == 'yes', 1, 0), df.index)
df.head()

Unnamed: 0,Product,Issue,complaint_text,Company,State,company_response,disputed,complaint_ID,sentences,words,special_chars,preprocessed_complaint,polarity
0,Credit reporting,Incorrect information on credit report,I have outdated information on my credit repor...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,Closed with explanation,No,2141773,1,34,0,"[outdated, information, credit, report, previo...",-0.155556
1,Consumer Loan,Managing the loan or lease,I purchased a new car on XXXX XXXX. The car de...,"CITIZENS FINANCIAL GROUP, INC.",PA,Closed with explanation,No,2163100,22,424,0,"[purchase, new, car, xxxx, xxxx, car, dealer, ...",0.076033
2,Credit reporting,Credit reporting company's investigation,An account on my credit report has a mistaken ...,Experian Information Solutions Inc.,VA,Closed with explanation,No,1420702,11,232,0,"[account, credit, report, mistake, date, mail,...",0.15625
3,Debt collection,Disclosure verification of debt,This company refuses to provide me verificatio...,"The CBE Group, Inc.",TX,Closed with explanation,Yes,1772196,2,27,0,"[company, refuse, provide, verification, valid...",0.285714
4,Debt collection,Improper contact or sharing of info,This complaint is in regards to Square Two Fin...,SQUARETWO FINANCIAL CORPORATION,NE,Closed with explanation,Yes,1790634,17,337,0,"[complaint, regard, square, two, financial, re...",-0.016548


In [52]:
# Feature Engineering - Create column for # of sentences in complaint

sent_tokens = [sent_tokenize(item) for item in df.complaint_text]
len_sent_tokens = []

for i in range(len(sent_tokens)):
    len_sent_tokens.append(len(sent_tokens[i]))

df['sentences'] = len_sent_tokens


In [53]:
# Feature Engineering - Create column for # of words in complaint

word_tokens = [word_tokenize(item) for item in df.complaint_text]
len_word_tokens = []

for i in range(len(word_tokens)):
    len_word_tokens.append(len(word_tokens[i]))

df['words'] = len_word_tokens


In [54]:
# Feature Engineering - Create column for # of ? and ! in complaint

df['special_chars'] = df.complaint_text.str.count('\!') + df.complaint_text.str.count('\?')
df.head()


Unnamed: 0,Product,Issue,complaint_text,Company,State,company_response,disputed,complaint_ID,sentences,words,special_chars
0,Credit reporting,Incorrect information on credit report,I have outdated information on my credit repor...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,Closed with explanation,No,2141773,1,34,0
1,Consumer Loan,Managing the loan or lease,I purchased a new car on XXXX XXXX. The car de...,"CITIZENS FINANCIAL GROUP, INC.",PA,Closed with explanation,No,2163100,22,424,0
2,Credit reporting,Credit reporting company's investigation,An account on my credit report has a mistaken ...,Experian Information Solutions Inc.,VA,Closed with explanation,No,1420702,11,232,0
3,Debt collection,Disclosure verification of debt,This company refuses to provide me verificatio...,"The CBE Group, Inc.",TX,Closed with explanation,Yes,1772196,2,27,0
4,Debt collection,Improper contact or sharing of info,This complaint is in regards to Square Two Fin...,SQUARETWO FINANCIAL CORPORATION,NE,Closed with explanation,Yes,1790634,17,337,0


In [55]:
round(df.describe().T,3)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
complaint_ID,10541.0,1930459.302,344415.478,1290516.0,1629406.0,1959481.0,2242317.0,2472897.0
sentences,10541.0,9.36,8.241,1.0,4.0,7.0,12.0,72.0
words,10541.0,207.878,183.814,2.0,77.0,146.0,277.0,1010.0
special_chars,10541.0,0.52,2.029,0.0,0.0,0.0,0.0,75.0


In [56]:
# words to remove from stop words
stop_remove = ['no', 'not', "don't"]

def remove_stop_words(list):
    '''removing multiple stop words instead of repetitive code'''
    
    global STOP_WORDS
    STOP_WORDS = stopwords.words('english')
    for i in range(len(list)):
        STOP_WORDS.remove(list[i])
    #return STOP_WORDS
    
# remove negation stop words
remove_stop_words(stop_remove)


In [57]:
# text preprocessing formulas
def preprocess_text(text):
    '''tokenize, lemmatize and remove stop words from text'''
    
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    bigram = list(ngrams(tokens, 2)) 
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stopwords
    keywords= [lemma for lemma in lemmas if lemma not in STOP_WORDS]
    return keywords

In [58]:
# Get the polarity score of each complaint
def get_TextBlob_score(text):
    '''Assign polarity score between -1 to 1 for text'''
    
    polarity = TextBlob(text).sentiment.polarity
    return polarity

In [64]:
# Preprocess complaint_text column then run the textblob polarity 
df['preprocessed_complaint'] = df.complaint_text.apply(preprocess_text)

pol = [get_TextBlob_score(str(complaint)) for complaint in df.preprocessed_complaint]
    
df['polarity'] = pol


In [66]:
print(df.polarity.nsmallest())
print(df.polarity.nlargest())

1764    -1.0
6014    -1.0
10266   -1.0
1123    -0.8
2531    -0.8
Name: polarity, dtype: float64
3542    1.0
8538    1.0
5278    0.8
8823    0.8
9632    0.8
Name: polarity, dtype: float64


In [73]:
print('Most positive review: ', df.iloc[3542,2])
print()
print('Most Negative Review: ', df.iloc[1764,2])

Most positive review:  Hello, I am disputing the charge of {$1000.00} charged on XXXX XXXX, 2014 by merchant XXXX XXXX with my bank ( Bancorp South ). For detailed information please refer to the attached document. 
Best Regards, XXXX XXXX

Most Negative Review:  XX/XX/2016 I received a call from Discover stating they were calling because someone was attempting to use card, I then informed them it was not us. I was under impression that they would not allow transactions to go through. Well XX/XX/XXXX I receive an outrageous bill, they allowed the transactions to go through. I filed dispute end result was that signatures match so they determined it was n't fraud. It is in fact fraud, I asked them to show me footage of security camera and that I had told them not to allow transactions because we were both at work, not out shopping.


In [71]:
df.head()

Unnamed: 0,Product,Issue,complaint_text,Company,State,company_response,disputed,complaint_ID,sentences,words,special_chars,preprocessed_complaint,polarity
0,Credit reporting,Incorrect information on credit report,I have outdated information on my credit repor...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,Closed with explanation,No,2141773,1,34,0,"[outdated, information, credit, report, previo...",-0.155556
1,Consumer Loan,Managing the loan or lease,I purchased a new car on XXXX XXXX. The car de...,"CITIZENS FINANCIAL GROUP, INC.",PA,Closed with explanation,No,2163100,22,424,0,"[purchase, new, car, xxxx, xxxx, car, dealer, ...",0.076033
2,Credit reporting,Credit reporting company's investigation,An account on my credit report has a mistaken ...,Experian Information Solutions Inc.,VA,Closed with explanation,No,1420702,11,232,0,"[account, credit, report, mistake, date, mail,...",0.15625
3,Debt collection,Disclosure verification of debt,This company refuses to provide me verificatio...,"The CBE Group, Inc.",TX,Closed with explanation,Yes,1772196,2,27,0,"[company, refuse, provide, verification, valid...",0.285714
4,Debt collection,Improper contact or sharing of info,This complaint is in regards to Square Two Fin...,SQUARETWO FINANCIAL CORPORATION,NE,Closed with explanation,Yes,1790634,17,337,0,"[complaint, regard, square, two, financial, re...",-0.016548


In [91]:
df.Issue.value_counts()

Incorrect information on credit report      1358
Cont'd attempts collect debt not owed       1072
Loan servicing, payments, escrow account     973
Loan modification,collection,foreclosure     653
Dealing with my lender or servicer           541
                                            ... 
Attempts to collect debt not owed              1
Overlimit fee                                  1
Struggling to pay mortgage                     1
Shopping for a line of credit                  1
Lost or stolen money order                     1
Name: Issue, Length: 86, dtype: int64

In [89]:
# Save DataFrame as csv to be used for EDA and Modeling
save = os.path.join(save_path, 'data/processed', 'DW_complaints')
df.to_csv(save)


In [74]:
# Split data between train and test set

X = df.drop('Product', axis=1)
y = df.Product

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34)


# Create Tf-Idf

- Using the sklearn package, I have created a Tf-Idf of the top 1000 words, bi-grams and tri-grams in the dataset
- I fit and transformed on the training set and transformed the test set using the same vectorizer to keep it consistent between the two
- I converted the Tf-Idfs into dataframes to be used to create a classifier to predict the product alignment of each review

In [75]:
# Create tfidf matrix with the complaint text column using standard preprocessing and bi-grams
start_time = time.time()

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=1000, stop_words=STOP_WORDS)
tfidf_train = tfidf_vectorizer.fit_transform(X_train.complaint_text)
tfidf_test = tfidf_vectorizer.transform(X_test.complaint_text)

# convert tfidf to dataframe 
tfidf_train = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
tfidf_test = pd.DataFrame(tfidf_test.A, columns=tfidf_vectorizer.get_feature_names())

end_time = time.time()
print(end_time - start_time)

9.364492177963257


In [76]:
print(tfidf_train.head())
print(tfidf_train.shape)

    00  00 month  00 per  00 xxxx  00 xxxx xxxx   10  100  100 00  1000  \
0  0.0       0.0     0.0      0.0           0.0  0.0  0.0     0.0   0.0   
1  0.0       0.0     0.0      0.0           0.0  0.0  0.0     0.0   0.0   
2  0.0       0.0     0.0      0.0           0.0  0.0  0.0     0.0   0.0   
3  0.0       0.0     0.0      0.0           0.0  0.0  0.0     0.0   0.0   
4  0.0       0.0     0.0      0.0           0.0  0.0  0.0     0.0   0.0   

   1000 00  ...  xxxx xxxx credit  xxxx xxxx not  xxxx xxxx received  \
0      0.0  ...               0.0            0.0                 0.0   
1      0.0  ...               0.0            0.0                 0.0   
2      0.0  ...               0.0            0.0                 0.0   
3      0.0  ...               0.0            0.0                 0.0   
4      0.0  ...               0.0            0.0                 0.0   

   xxxx xxxx xx  xxxx xxxx xxxx  xxxx years  year  years  years ago  yet  
0           0.0             0.0         0

In [90]:
# save train set
save = os.path.join(save_path, 'data/processed', 'tfidf_train')
tfidf_train.to_csv(save)

# save test set
save = os.path.join(save_path, 'data/processed', 'tfidf_test')
tfidf_test.to_csv(save)

# Steps - 11/15/20:

1. Create Tf-Idf vectorizer
2. Create classifier for Product
3. Feature Engineering prior to prediction of severity sentiment
4. Sentiment analysis to preduct sentiment
    1. TextBlob for polarity
        1. Preprocess the complaint before TextBlob?
    2. word2vec for word similarities
    3. 

### Thoughts:
- can use max_features argument of CountVectorizer and TfidfVectorizer to only include X amount of features
- Bigrams/Trigrams (ngram_range parameter in TfidfVectorizer)
    - can do a gridsearch to find whether bigrams or trigrams are better
- dimensionality reduction
    - find optimal number of features
- train and run classifiers for product alignment of complaint - naive bayes? linear models?
- figure out how to predict severity
    - TextBlob or IBM Watson sentiment analysis to get polarity of complaint
    - creating other features using text column
        - number of punctuation marks (indicator of emotionally charged statement)
        - number of words, sentences in each complaint
    - consumer disputed column?
        - discontinued 4/23/17 - latest date in data is 9/9/17
        - could be indicator or the severity of complaint
- recommend next action based on product alignment and severity

### Ideas:
- highly sensitive products?
- pipeline
    - gridsearch on pipeline?
- spaCy
    - word2vec NN



# Modeling - Predict Product Alignment

### Create a classifier that will predict which product the complaint is aligned to 
- This will allow the end user of this model to be able to identify areas with a higher number of complaints with respect to others

In [77]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
rf_classifier = RandomForestClassifier()
rf_classifier.fit(tfidf_train, y_train)
rf_predictions = rf_classifier.predict(tfidf_test)


In [78]:
print(classification_report(y_test, rf_predictions, zero_division=0))

                             precision    recall  f1-score   support

    Bank account or service       0.79      0.74      0.77       184
              Consumer Loan       0.83      0.93      0.87       707
Credit card or prepaid card       0.81      0.72      0.76       279
           Credit reporting       0.83      0.80      0.81       403
            Debt collection       0.81      0.81      0.81       511
    Other financial service       0.00      0.00      0.00        25

                   accuracy                           0.82      2109
                  macro avg       0.68      0.67      0.67      2109
               weighted avg       0.81      0.82      0.81      2109



In [18]:
# Other tfidf attempts using formulas for preprocessing

'''# Create tfidf matrix with the complaint text column using preprocess_text formula as the analyzer parameter
start_time = time.time()

tfidf_vectorizer = TfidfVectorizer(analyzer=preprocess_text, lowercase=False)
tfidf_train = tfidf_vectorizer.fit_transform(X_train.complaint_text)
tfidf_test = tfidf_vectorizer.transform(X_test.complaint_text)

# convert tfidf to dataframe 
df_tfidf = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

end_time = time.time()
print(end_time - start_time)

# Create tfidf matrix with the complaint text column using preprocess_text formula as the analyzer parameter
start_time = time.time()

tfidf_vectorizer = TfidfVectorizer(analyzer=preprocess_text, lowercase=False, ngram_range=(1, 2), 
                                   stop_words=STOP_WORDS)
tfidf_train = tfidf_vectorizer.fit_transform(X_train.complaint_text)
tfidf_test = tfidf_vectorizer.transform(X_test.complaint_text)

# convert tfidf to dataframe 
df_tfidf = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

end_time = time.time()
print(end_time - start_time)'''

'# Create tfidf matrix with the complaint text column using preprocess_text formula as the analyzer parameter\nstart_time = time.time()\n\ntfidf_vectorizer = TfidfVectorizer(analyzer=preprocess_text, lowercase=False)\ntfidf_train = tfidf_vectorizer.fit_transform(X_train.complaint_text)\ntfidf_test = tfidf_vectorizer.transform(X_test.complaint_text)\n\n# convert tfidf to dataframe \ndf_tfidf = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())\n\nend_time = time.time()\nprint(end_time - start_time)\n\n# Create tfidf matrix with the complaint text column using preprocess_text formula as the analyzer parameter\nstart_time = time.time()\n\ntfidf_vectorizer = TfidfVectorizer(analyzer=preprocess_text, lowercase=False, ngram_range=(1, 2), \n                                   stop_words=STOP_WORDS)\ntfidf_train = tfidf_vectorizer.fit_transform(X_train.complaint_text)\ntfidf_test = tfidf_vectorizer.transform(X_test.complaint_text)\n\n# convert tfidf to dataframe \ndf_tfid

In [None]:
'''# create count vectorizer matrix
count_vectorizer = CountVectorizer(analyzer=preprocess_text, lowercase=False)
count_vec = count_vectorizer.fit_transform(corpus)

# convert tfidf to dataframe 
count_vec = tfidf.todense()
count_vec = pd.DataFrame(count_vec, columns=count_vectorizer.get_feature_names())'''

In [None]:
# Attempts at concatenating tfidf DF with original data df
'''
Attempt 1:
df_tfidf = pd.concat([df, tfidf], axis=1)
print(df_tfidf.info())
print(df_tfidf.head())

Attempt 2:
df_tfidf = df.merge(tfidf, left_index=True, right_index=True)
df_tfidf

Attempt 3:
indexes = np.linspace(0, len(df), num=10, dtype=np.int32)

df_tfidf = pd.DataFrame()
for i in range(len(indexes) - 1):
    df_tfidf = df_tfidf.append(pd.concat([df.loc[indexes[i]:indexes[i+1], :], 
                                        tfidf.loc[indexes[i]:indexes[i+1], :]],
                                       axis=1))
'''

In [16]:
# preprocess complaint_text column - 4 min runtime with large dataset, 3 with small dataset
#df.complaint_text = df.complaint_text.apply(preprocess_text)
#df.complaint_text