# Sprint 1: Data analysis and supervised learning

## Opmerkingen dataset

* Complaint volume moet in verband worden gebracht met de grootte en het marktaandeel van het bedrijf.
Een bedrijf met een groot aantal klanten zal bijvoorbeeld meer klachten hebben dan één met een klein aantal klanten.

## Inlezen dataset

In [11]:
import pandas as pd
import csv

def read_all_complaints() :
    # Store all the data of the complaints in an array of maps. (amount of maps = amount of complaints)
    all_complaints = {}
    with open('data/complaints-100.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        for row in csv_reader:
            if line_count == 0:
                print("All the information stored about a complaint:")
                for category in row:
                    print("-" + category)
                line_count += 1
            else: 
                complaint = " ".join(row[1:5])
                all_complaints[line_count] = complaint
                line_count += 1
        print(f'File processed of {line_count} lines.')
        return all_complaints
    csv_file.close()
    
complaints_dict = read_all_complaints()


df = pd.DataFrame(list(complaints_dict.items()),columns=["Complaint Index","Complaint"])
print("Number of complaints in dataset: ",df.shape[0])
print(df.head(10))

All the information stored about a complaint:
-Date received
-Product
-Sub-product
-Issue
-Sub-issue
-Consumer complaint narrative
-Company public response
-Company
-State
-ZIP code
-Tags
-Consumer consent provided?
-Submitted via
-Date sent to company
-Company response to consumer
-Timely response?
-Consumer disputed?
-Complaint ID
File processed of 120 lines.
Number of complaints in dataset:  119
   Complaint Index                                          Complaint
0                1  Credit reporting, credit repair services, or o...
1                2  Debt collection I do not know False statements...
2                3  Debt collection I do not know Attempts to coll...
3                4  Debt collection Other debt Attempts to collect...
4                5  Credit reporting, credit repair services, or o...
5                6  Debt collection Mortgage debt Attempts to coll...
6                7  Credit reporting, credit repair services, or o...
7                8  Credit reporting, 

## Bepalen van het aantal producten en sub-producten

In [None]:
def give_product_complaint_amount():
    products = {}
    for complaint in complaints:
        product=complaint["product"]
        if product in products:
            products[complaint["product"]] +=1
        else:
            products[complaint["product"]] = 0
    for product in products:
        amount = str(products[product])
    return pd.DataFrame(products.items(), columns=["Product", "Amount"]) 

In [None]:
def give_subproduct_complaint_amount():
    sub_products = {}
    for complaint in complaints:
        sub_product=complaint["sub-product"]
        if sub_product in sub_products:
            sub_products[complaint["sub-product"]] +=1
        else:
            sub_products[complaint["sub-product"]] = 0
    for sub_product in sub_products:
        amount = str(sub_products[sub_product])
    return pd.DataFrame(sub_products.items(), columns=["Sub-Product", "Amount"]) 

In [None]:
print(give_product_complaint_amount())
print(give_subproduct_complaint_amount())

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer(
    stop_words="english",
    ngram_range=(1,2),
    min_df=2,
    max_df=0.5
)
X_train_counts = count_vect.fit_transform(df["Complaint"])

tf_transformer = TfidfTransformer()
X_train_tf = tf_transformer.fit_transform(X_train_counts)

In [13]:
feature_names=count_vect.get_feature_names()
print(feature_names[:10])

print(len(feature_names))
print(X_train_tf.shape)

['account', 'account checking', 'account information', 'account problem', 'account status', 'advertised', 'advertising', 'advertising marketing', 'alerts', 'alerts security']
296
(119, 296)


In [15]:
print(df.at[0, "Complaint"])
features = X_train_tf[0]
terms = pd.DataFrame(features.T.todense(), index=feature_names, columns=["tfidf"])
terms.sort_values(by=["tfidf"],ascending=False).head(n=10)

Credit reporting, credit repair services, or other personal consumer reports Credit reporting Credit monitoring or identity theft protection services Problem canceling credit monitoring or identify theft protection service


Unnamed: 0,tfidf
credit monitoring,0.351708
theft protection,0.351708
protection,0.351708
monitoring,0.351708
theft,0.286905
identify theft,0.187349
identify,0.187349
problem canceling,0.187349
monitoring identify,0.187349
canceling credit,0.187349


In [16]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=5, metric="euclidean")
knn.fit(X_train_tf.toarray())
distances, neighbors = knn.kneighbors(features.toarray())

for distance, neighbor in zip(distances[0], neighbors[0]):
    print(distance, df.at[neighbor, 'Complaint'], sep=":\t")

0.0:	Credit reporting, credit repair services, or other personal consumer reports Credit reporting Credit monitoring or identity theft protection services Problem canceling credit monitoring or identify theft protection service
0.0:	Credit reporting, credit repair services, or other personal consumer reports Credit reporting Credit monitoring or identity theft protection services Problem canceling credit monitoring or identify theft protection service
0.8374303414739717:	Credit reporting, credit repair services, or other personal consumer reports Credit reporting Credit monitoring or identity theft protection services Didn't receive services that were advertised
1.3236454121859305:	Debt collection I do not know Attempts to collect debt not owed Debt was result of identity theft
1.3236454121859305:	Debt collection I do not know Attempts to collect debt not owed Debt was result of identity theft
