
## Text Classification 

### Predicting Category

#### Objectives

On completing the assignment, you will be able to write a simple AI supervised classification application.

#### Description

In the last assignment, you were provided a data set that had 20 columns. That assignment used only a set of five of those columns. In this assignment, you will use a different set of five columns and they are listed below.

Columns to be used in this assignment:
Own_property, Years_employed, Income_type, Family_status, and Target

Besides the above change in data, your code should provide the same functionality as the last assignment.


### Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

###### Coding

Follow the steps of the last assignment.



## Keith Yrisarri Stateson
June 16, 2024. Python 3.11.0

## Title: Credit Card Suitability Prediction Using K-Nearest Neighbors (KNN)

This program is designed to train a K-Nearest Neighbors (KNN) classifier to predict the suitability of individuals for credit card issuance based on their demographic and financial data. The program consists of three main parts:

### Training the Classifier
- Loads the training dataset.
- Preprocesses the data (including handling categorical variables and scaling numerical features).
- Splits the data into training and testing sets.
- Trains a KNN classifier on the training set.

### Making Predictions:
- Defines a function to preprocess new data in the same way as the training data.
- Uses the trained classifier to make predictions on the new data.


### Processing a New Dataset:
- Loads a new dataset.
- Preprocesses the new data.
- Makes predictions for each entry in the new dataset.
- Saves the results to a new CSV file.

In [1]:
# import library and read the dataset from the csf file into a pandas dataframe
import pandas as pd
df=pd.read_csv('k_creditcard.csv')

In [3]:
# examine the shape of the dataframe and display the header row and the first 3 rows of data
print (df.shape)
df[0:3]

(9709, 20)


Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
0,5008804,1,1,1,1,0,0,0,0,2,15,427500.0,32.868574,12.435574,Working,Higher education,Civil marriage,Rented apartment,Other,1
1,5008806,1,1,1,0,0,0,0,0,2,29,112500.0,58.793815,3.104787,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
2,5008808,0,0,1,0,1,1,0,0,1,4,270000.0,52.321403,8.353354,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0


In [4]:
# Create a dataframe with only the columns that are needed for the analysis
df=df.filter(['Own_property', 'Years_employed', 'Income_type', 'Family_status', 'Target'])

In [6]:
# examine the shape of the dataframe and display the header row and the first 3 rows of data
print(df.shape)
df[0:3]

(9709, 5)


Unnamed: 0,Own_property,Years_employed,Income_type,Family_status,Target
0,1,12.435574,Working,Civil marriage,1
1,1,3.104787,Working,Married,0
2,1,8.353354,Commercial associate,Single / not married,0


In [7]:
# examine the unique values of the target column
df.Target.value_counts()

Target
0    8426
1    1283
Name: count, dtype: int64

In [9]:
# examine the unique values of the Own_property column and create a pandas series
y=df.Target
print(type(y))

<class 'pandas.core.series.Series'>


In [10]:
# remove the target column from the dataframe to prepare the dataframe (independent variables) for training a machine learning model
df=df.drop('Target', axis=1)
df[0:3]

Unnamed: 0,Own_property,Years_employed,Income_type,Family_status
0,1,12.435574,Working,Civil marriage
1,1,3.104787,Working,Married
2,1,8.353354,Commercial associate,Single / not married


In [12]:
# create a dataframe of the categorical variables
df_cat = df.filter(['Income_type', 'Family_status'])
df_cat[0:3]

Unnamed: 0,Income_type,Family_status
0,Working,Civil marriage
1,Working,Married
2,Commercial associate,Single / not married


In [13]:
# examine the unique values of the Income_type column
df_cat.Income_type.value_counts()

Income_type
Working                 4960
Commercial associate    2312
Pensioner               1712
State servant            722
Student                    3
Name: count, dtype: int64

In [14]:
# examine the unique values of the Family_status column
df_cat.Family_status.value_counts()

Family_status
Married                 6530
Single / not married    1359
Civil marriage           836
Separated                574
Widow                    410
Name: count, dtype: int64

In [17]:
# convert the categorical variables into dummy variables
df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)
df_cat_num[0:3]

Unnamed: 0,Income_type_Pensioner,Income_type_State servant,Income_type_Student,Income_type_Working,Family_status_Married,Family_status_Separated,Family_status_Single / not married,Family_status_Widow
0,0,0,0,1,0,0,0,0
1,0,0,0,1,1,0,0,0
2,0,0,0,0,0,0,1,0


In [19]:
# create a dataframe of the numerical variables
df_num=df.filter(['Own_property', 'Years_employed'], axis=1)
df_num[0:3]

Unnamed: 0,Own_property,Years_employed
0,1,12.435574
1,1,3.104787
2,1,8.353354


In [21]:
# combine the numerical and dummy variables into a single dataframe
X=pd.concat([df_num, df_cat_num], axis=1)
print(X.shape)
X[0:3]

(9709, 10)


Unnamed: 0,Own_property,Years_employed,Income_type_Pensioner,Income_type_State servant,Income_type_Student,Income_type_Working,Family_status_Married,Family_status_Separated,Family_status_Single / not married,Family_status_Widow
0,1,12.435574,0,0,0,1,0,0,0,0
1,1,3.104787,0,0,0,1,1,0,0,0
2,1,8.353354,0,0,0,0,0,0,1,0


Explanation of the parameter: random_state

Purpose
The random_state parameter acts as a seed for the random number generator used by the train_test_split function. This ensures that the split of data into training and testing sets is the same each time you run the code.

Reproducibility
By setting random_state to a specific integer value (e.g., random_state=1), you ensure that the same rows are selected for the training and testing sets every time you execute the code. This is important for reproducibility, especially when sharing code or comparing results across different runs.

Default Behavior
If you do not include the random_state parameter, the train_test_split function will use a different random seed each time it is run. This means the training and testing sets will vary with each execution, leading to potentially different results.

In [23]:
# split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [27]:
# intiliaze the Scaler, fit and transform the training data, and transform the test data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

print(X_train_scaled[0:3])
print('\n')
print(X_test_scaled[0:3])

[[-1.43201526 -0.89216039  2.14378668 -0.28592809 -0.01134753 -1.00996337
  -1.42576284 -0.25321606  2.46711954 -0.2096464 ]
 [ 0.69831658  2.49958111 -0.46646432 -0.28592809 -0.01134753  0.99013492
  -1.42576284 -0.25321606 -0.40533099  4.76993643]
 [-1.43201526  0.29689841 -0.46646432  3.49738281 -0.01134753 -1.00996337
   0.70137892 -0.25321606 -0.40533099 -0.2096464 ]]


[[ 0.69831658  1.4655981  -0.46646432 -0.28592809 -0.01134753 -1.00996337
   0.70137892 -0.25321606 -0.40533099 -0.2096464 ]
 [ 0.69831658 -0.48151276 -0.46646432 -0.28592809 -0.01134753  0.99013492
   0.70137892 -0.25321606 -0.40533099 -0.2096464 ]
 [ 0.69831658 -0.2974144  -0.46646432 -0.28592809 -0.01134753  0.99013492
   0.70137892 -0.25321606 -0.40533099 -0.2096464 ]]


In [31]:
# train the select model with scaled training data and make predictions on the scaled test data
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)

clf.fit (X_train_scaled, y_train)    

In [32]:
# Test the trained model with the scaled test data / make predictions based on the test data
y_pred = clf.predict(X_test_scaled)


In [35]:
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, y_pred))
print('\n')
print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
print('\n')
print('ROC AUC: ', roc_auc_score(y_test, y_pred))

Accuracy:  0.850669412976313


Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.98      0.92      1667
           1       0.32      0.05      0.08       275

    accuracy                           0.85      1942
   macro avg       0.59      0.52      0.50      1942
weighted avg       0.78      0.85      0.80      1942



Confusion Matrix: 
 [[1639   28]
 [ 262   13]]


ROC AUC:  0.5152380433004309


Accuracy
Accuracy: 0.850669412976313
Meaning: 85.07% of the predictions made by your model were correct.
Formula: Total Number of Predictions / Total Number of Correct Predictions
​Formula: TP+TN / TP+TN+FP+FN
True Positives (TP)
False Positives (FP)
False Negatives (FN)
True Negatives (TN)


 
Classification Report
This report includes precision, recall, f1-score, and support for each class (0 and 1).

Class 0 (No Default)
Precision: 0.86
Of all the individuals predicted to not default (class 0), 86% actually did not default.
Formula: TP / TP + FP

Recall: 0.98
Of all the individuals who actually did not default, 98% were correctly predicted as not defaulting.
Formula: TP / TP+FN

F1-Score: 0.92
The harmonic mean of precision and recall. A balance between precision and recall.
Formula: F1-Score = 2 × Precision × Recall / Precision + Recall

Support: 1667

Class 1 (Default)
Precision: 0.32
Of all the individuals predicted to default (class 1), 32% actually defaulted.
Formula: TP / TP + FP

Recall: 0.05
Of all the individuals who actually defaulted, 5% were correctly predicted as defaulting.
Formula: TP / TP+FN

F1-Score: 0.08
The harmonic mean of precision and recall.
Formula: F1-Score = 2 × Precision × Recall / Precision + Recall

Support: 275
The number of actual occurrences of the class in the dataset.



Overall Metrics
Accuracy: 0.85
The overall accuracy of the model.

Macro Average:
Macro Avg Precision: 0.59
Average precision across both classes.
Macro Avg Recall: 0.52
Average recall across both classes.
Macro Avg F1-Score: 0.50
Average F1-score across both classes.

Weighted Average:
Weighted Avg Precision: 0.78
Precision weighted by the number of instances in each class.
Weighted Avg Recall: 0.85
Recall weighted by the number of instances in each class.
Weighted Avg F1-Score: 0.80
F1-score weighted by the number of instances in each class.



Confusion Matrix
[ 1639 28
262 13]

True Positives (TP) for Class 0: 1639
Correctly predicted as not defaulting.
False Positives (FP) for Class 0: 28
Incorrectly predicted as not defaulting, but actually defaulted.
False Negatives (FN) for Class 1: 262
Incorrectly predicted as defaulting, but actually did not default.
True Positives (TP) for Class 1: 13
Correctly predicted as defaulting.



ROC AUC
ROC AUC: 0.5152380433004309

ROC AUC (Receiver Operating Characteristic Area Under Curve): Measures the model's ability to distinguish between the classes.
Range: 0.5 (random guessing) to 1 (perfect classifier).
Interpretation: A ROC AUC of 0.515 indicates that the model is slightly better than random guessing.

In [61]:
# Make a credit card suitability prediction for the provided input

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

def predict_credit_card_suitability(own_property, years_employed, income_type, family_status):
    # Load the dataset
    df = pd.read_csv('k_creditcard.csv')
    df = df.filter(['Own_property', 'Years_employed', 'Income_type', 'Family_status', 'Target'])

    y=df['Target']
    df = df.drop('Target', axis=1)

    df_cat = df.filter(['Income_type','Family_status'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)    
    df_num = df.filter(['Own_property','Years_employed'],axis=1)
    X = pd.concat([df_num, df_cat_num], axis=1)

    # split the df into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # intiliaze the Scaler, fit and transform the training data, and transform the test data
    sc= StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    X_test_scaled = sc.transform(X_test)
    
    # train the KNeighborsClassifier model
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit (X_train_scaled, y_train)
    
    # skip the evauation of the model as the accuray is high
    
    # Make a prediction for the provided input
    input_data = pd.DataFrame([[own_property, years_employed, income_type, family_status]], 
                              columns=['Own_property', 'Years_employed', 'Income_type', 'Family_status'])
    input_data_cat = input_data.filter(['Income_type','Family_status'])
    input_data_cat_num = pd.get_dummies(input_data_cat, dtype=int, drop_first=True)
    input_data_num = input_data.filter(['Own_property', 'Years_employed'], axis=1)
    input_data_final = pd.concat([input_data_num, input_data_cat_num], axis=1)
    
    # Align input_data_final with the training data's columns because of the categorical columns
    missing_cols = set(X.columns) - set(input_data_final.columns)
    for col in missing_cols:
        input_data_final[col] = 0
    input_data_final = input_data_final[X.columns]
    
    # Scale input data using the same scaler fitted on training data
    input_data_scaled = sc.transform(input_data_final)

    # Make a prediction for the provided input
    prediction = clf.predict(input_data_scaled)
    
    return prediction[0]
    # return 'Suitable' if prediction[0] == 1 else 'Not Suitable'
    # return f"{'Suitable' if prediction[0] == 1 else 'Not Suitable'} ({prediction[0]})"


In [63]:
# Examples of making a credit card suitability prediction given the input
print(predict_credit_card_suitability(1, 5, 'Working', 'Married'))
print(predict_credit_card_suitability(0, 10, 'Pensioner', 'Single / not married'))
print(predict_credit_card_suitability(1, 2, 'Commercial associate', 'Civil marriage'))
print(predict_credit_card_suitability(0, 15, 'State servant', 'Widow'))
print(predict_credit_card_suitability(1, 8, 'Student', 'Separated'))
print(predict_credit_card_suitability(0, 22.91354374148682, 'Working', 'Maried'))

1
0
0
0
0
0


In [58]:
# Make credit card suitability predictions for a new dataset and save the results to a new CSV

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

def train_knn_classifier():
    # Load the dataset
    df = pd.read_csv('k_creditcard.csv')
    df = df.filter(['Own_property', 'Years_employed', 'Income_type', 'Family_status', 'Target'])

    y = df['Target']
    df = df.drop('Target', axis=1)

    df_cat = df.filter(['Income_type','Family_status'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)    
    df_num = df.filter(['Own_property','Years_employed'],axis=1)
    X = pd.concat([df_num, df_cat_num], axis=1)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

    # Initialize the Scaler, fit and transform the training data, and transform the test data
    sc = StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    X_test_scaled = sc.transform(X_test)

    # Train the KNeighborsClassifier model
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit(X_train_scaled, y_train)

    return clf, sc, X.columns

def predict_credit_card_suitability_from_df(new_df, clf, sc, columns):
    # Preprocess the new data
    df_cat = new_df.filter(['Income_type','Family_status'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)
    df_num = new_df.filter(['Own_property','Years_employed'],axis=1)
    input_data_final = pd.concat([df_num, df_cat_num], axis=1)

    # Align input_data_final with the training data's columns because of the categorical columns
    missing_cols = set(columns) - set(input_data_final.columns)
    for col in missing_cols:
        input_data_final[col] = 0
    input_data_final = input_data_final[columns]
    
    # Scale input data using the same scaler fitted on training data
    input_data_scaled = sc.transform(input_data_final)
    
    # Make predictions for the provided input
    predictions = clf.predict(input_data_scaled)
    
    return predictions

# Train the model and scaler
clf, sc, columns = train_knn_classifier()

# Load the new dataset
new_df = pd.read_csv('new_creditcard_applications.csv')
# new_data_path = 'path_to_new_data.csv'
# new_df_with_predictions = pd.read_csv(new_data_path)
new_df_with_predictions = new_df.filter(['Own_property', 'Years_employed', 'Income_type', 'Family_status'])

# Make predictions and add them to the new dataframe as a new column
new_df_with_predictions['Suitability'] = predict_credit_card_suitability_from_df(new_df_with_predictions, clf, sc, columns)

# Display the first few rows of the updated dataframe
print(new_df_with_predictions.head())

# Save the new dataframe with the predictions to a new CSV file
new_df_with_predictions.to_csv('predicted_suitability.csv', index=False)


   Own_property  Years_employed           Income_type         Family_status  \
0             1        2.160209  Commercial associate  Single / not married   
1             1        2.746121               Working        Civil marriage   
2             1        0.000000             Pensioner               Married   
3             1        0.495561               Working  Single / not married   
4             1        5.076080               Working               Married   

   Suitability  
0            0  
1            0  
2            0  
3            0  
4            0  
