<a href="https://colab.research.google.com/github/Usamamalik11/Darth-Vader/blob/master/Text%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
df = pd.read_csv(r'/content/drive/My Drive/Colab Notebooks/Consumer_Complaints.csv', nrows = 10000) ## Use larger dataset for more accurate results. I have only taken 10000 rows as I was running out of memory.
df.head()


In [0]:
## Cleaning and Preprocessing the data

from io import StringIO ## In case I want a file-like object that ACTS like a file, but is writing to an in-memory string buffer.
col = ['Product', 'Consumer complaint narrative']
df = df[col]
df = df[pd.notnull(df['Consumer complaint narrative'])]   ## removes nan's
df.columns = ['Product', 'Consumer_complaint_narrative']   ## naming the columns 
df['category_id'] = df['Product'].factorize()[0]         ## will categorize acording to product
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')   ## a smaller table containing categorical products
category_to_id = dict(category_id_df.values)    ## creates a dictionary with mapping from product to category_id
id_to_category = dict(category_id_df[['category_id', 'Product']].values) ## creates a dictionary with mapping from category_id to product
df.head()

In [0]:
## Plotting the data
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))## width and height in inches
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)## counts cases of each product and plots them in bars
plt.show()

In [0]:
##Vectorizing the data
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')

features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id
features.shape



*   sublinear_df is set to True to use a logarithmic form for frequency.


*   min_df is the minimum numbers of documents a word must be present in to be kept.

*   norm is set to l2, to ensure all our feature vectors have a euclidian norm of 1.
*   ngram_range is set to (1, 2) to indicate that we want to consider both unigrams and bigrams.


*   stop_words is set to "english" to remove all common pronouns ("a", "the", ...) to reduce the number of noisy features 




### We can use sklearn.feature_selection.chi2 to find the terms that are the most correlated with each of the products.

In [0]:
from sklearn.feature_selection import chi2
import numpy as np

N = 2
for Product, category_id in sorted(category_to_id.items()):
  features_chi2 = chi2(features, labels == category_id)
  indices = np.argsort(features_chi2[0])
  feature_names = np.array(tfidf.get_feature_names())[indices]
  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
  print("# '{}':".format(Product))
  print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
  print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))



*   To train supervised classifiers, we first transformed the “Consumer complaint narrative” into a vector of numbers. We explored vector representations such as TF-IDF weighted vectors.

*   After having this vector representations of the text we can train supervised classifiers to train unseen “Consumer complaint narrative” and predict the “product” on which they fall.



### Naive Bayes Classifier:

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df['Product'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

Let's make a prediction

In [17]:
print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"])))

['Credit reporting, credit repair services, or other personal consumer reports']


In [18]:
df[df['Consumer_complaint_narrative'] == "I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"]

Unnamed: 0,Product,Consumer_complaint_narrative,category_id


We will evaluate the following four models:


*   Logistic Regression
*   (Multinomial) Naive Bayes

*   Linear Support Vector Machine
*   Random Forest





In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))

##creating a data frame for model name, fold number and corresponding accuracy
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

## plotted in two different ways 
import seaborn as sns 
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

In [20]:
cv_df.groupby('model_name').accuracy.mean() ## finding mean of accuracies of 5 folds

model_name
LinearSVC                 0.785254
LogisticRegression        0.733512
MultinomialNB             0.543499
RandomForestClassifier    0.480312
Name: accuracy, dtype: float64

We can see that LinearSVC is the best model in this case to as it shows grater accuracy.

In [0]:
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=category_id_df.Product.values, yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Let's see what misclassifications are caused by

In [0]:
from IPython.display import display
for predicted in category_id_df.category_id:
  for actual in category_id_df.category_id:
    if predicted != actual and conf_mat[actual, predicted] >= 10:
      print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
      display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product', 'Consumer_complaint_narrative']])
      print('')


Again, we use the chi-squared test to find the terms that are the most correlated with each of the categories:

In [0]:
model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
  indices = np.argsort(model.coef_[category_id])
  feature_names = np.array(tfidf.get_feature_names())[indices]
  unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
  bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
  print("# '{}':".format(Product))
  print("  . Top unigrams:\n       . {}".format('\n       . '.join(unigrams)))
  print("  . Top bigrams:\n       . {}".format('\n       . '.join(bigrams)))

Finally, we print out the classification report for each class:

In [0]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=df['Product'].unique()))## target names here are just labels
