## Text Classification 

### Predicting Category

#### Objectives

On completing the assignment, you will learn how to write a simple AI supervised classification application.

#### Description

Write an AI application which, when provided with a person's demographic, financial, and personal attributes such as age, salary, years employed etc., will predict a person's suitability for credit card issuance. For training and testing the application, please use the labeled dataset provided in the file, k_creditcard.csv. The dataset contains data regarding 9709 individuals. Each person is labeled as either suitable (assigned Target value of 1) or unsuitable (assigned Target value of 0). Use 80% of the data items for training, and the remaining 20% for testing. Train the sklearn's KNeighborsClassifier classifier with parameter n_neighbors set to 5. After the classifier is trained, test it using the test data and produce accuracy score, classification report, and confusion matrix. Then, optionally try out a few self-created individual values and note the application response. 

The dataset provided had 20 columns. However, I used only five of those columns and they are listed below.

Data columns used in this assignment:
Total_income, Age, Education_type, Housing_type, and Target


#### Implementation Notes


#### Dataset source

The dataset was downloaded from the Kaggle website below.

https://www.kaggle.com/datasets/rohit265/credit-card-eligibility-data-determining-factors


### Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

#### Coding

Follow the steps below.


## Keith Yrisarri Stateson
June 16, 2024. Python 3.11.0

## Title: Credit Card Suitability Prediction Using K-Nearest Neighbors (KNN)

This program is designed to train a K-Nearest Neighbors (KNN) classifier to predict the suitability of individuals for credit card issuance based on their demographic and financial data. The program consists of three main parts:

### Training the Classifier
- Loads the training dataset.
- Preprocesses the data (including handling categorical variables and scaling numerical features).
- Splits the data into training and testing sets.
- Trains a KNN classifier on the training set.

### Making Predictions:
- Defines a function to preprocess new data in the same way as the training data.
- Uses the trained classifier to make predictions on the new data.


### Processing a New Dataset:
- Loads a new dataset.
- Preprocesses the new data.
- Makes predictions for each entry in the new dataset.
- Saves the results to a new CSV file.

In [25]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [26]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Read the dataset from the csv file into a pandas dataframe

In [27]:
import pandas as pd
df=pd.read_csv('k_creditcard.csv')

Find dimensions of the dataframe (rows,columns) and display its fist few rows.

In [28]:
print (df.shape)
df[0:3]

(9709, 20)


Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
0,5008804,1,1,1,1,0,0,0,0,2,15,427500.0,32.868574,12.435574,Working,Higher education,Civil marriage,Rented apartment,Other,1
1,5008806,1,1,1,0,0,0,0,0,2,29,112500.0,58.793815,3.104787,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
2,5008808,0,0,1,0,1,1,0,0,1,4,270000.0,52.321403,8.353354,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0


Display value count for each category for column Target.

Filter only the needed columns

In [29]:
df=df.filter(['Total_income','Age','Education_type','Housing_type','Target'])

Find dimensions of the dataframe (rows,columns) and display its fist few rows.

In [30]:
print (df.shape)
df[0:3]

(9709, 5)


Unnamed: 0,Total_income,Age,Education_type,Housing_type,Target
0,427500.0,32.868574,Higher education,Rented apartment,1
1,112500.0,58.793815,Secondary / secondary special,House / apartment,0
2,270000.0,52.321403,Secondary / secondary special,House / apartment,0


Display the categories in Target column
0: Indicates that the individual did not default on their credit card payment
1: Indicates that the individual did default on their credit card payment

The Target column in the dataset represents the outcome or the dependent variable that the model aims to predict.

Model Objective
The objective of building a model with this dataset is to predict the value of the Target column based on the other features (e.g., total income, age, education type, and housing type). This involves training a machine learning model to learn the patterns and relationships between the input features and the Target variable.

Application in Credit Risk Analysis
Predicting the Target variable is crucial for financial institutions as it helps in assessing credit risk. By identifying individuals who are likely to default, institutions can make informed decisions about issuing credit cards, setting credit limits, and implementing risk mitigation strategies.

In [31]:
df.Target.value_counts()

Target
0    8426
1    1283
Name: count, dtype: int64

Assign the Target column to variable y to be used as labels for the features
variable y is a pandas series

Series vs. DataFrame
Series: A one-dimensional array-like structure that holds a single column of data.
DataFrame: A two-dimensional, tabular data structure with labeled axes (rows and columns).

In [32]:
y=df.Target
print (type(y))

<class 'pandas.core.series.Series'>


Drop Target column among the features
Droppping separates the features (independent variables) from the target (dependent variable) in preparation for training a machine learning model.

In [33]:
df=df.drop(['Target'],axis=1)
df[0:3]

Unnamed: 0,Total_income,Age,Education_type,Housing_type
0,427500.0,32.868574,Higher education,Rented apartment
1,112500.0,58.793815,Secondary / secondary special,House / apartment
2,270000.0,52.321403,Secondary / secondary special,House / apartment


Filter out the categorical columns (features) into a new dataframe for converting them to numerical columns (features)

Purpose of Filtering Categorical Columns

1. Separate Processing:
Categorical columns require different preprocessing steps compared to numerical columns. By filtering out these columns, you can handle them separately and apply the appropriate preprocessing techniques.

2. One-Hot Encoding:
Machine learning algorithms typically require numerical input. Categorical data needs to be converted into a numerical format.
One common technique for this conversion is one-hot encoding, which creates binary (0 or 1) columns for each unique category value. This allows the model to interpret categorical data in a meaningful way without assuming any ordinal relationship.

3. Feature Engineering:
By separating and encoding categorical columns, you create new features that represent the presence or absence of specific categories. This can enhance the model’s ability to capture patterns and relationships in the data.

In [34]:
df_cat=df.filter(['Education_type','Housing_type'])
df_cat [0:3]

Unnamed: 0,Education_type,Housing_type
0,Higher education,Rented apartment
1,Secondary / secondary special,House / apartment
2,Secondary / secondary special,House / apartment


In [35]:
print("Number of categories in 'Education_type':", len(df_cat.Education_type.value_counts()))

Number of categories in 'Education_type': 5


Display the categories of Education_type

In [36]:
df_cat.Education_type.value_counts()
#df_cat.groupby('Education_type').count() #alternative form

Education_type
Secondary / secondary special    6761
Higher education                 2457
Incomplete higher                 371
Lower secondary                   114
Academic degree                     6
Name: count, dtype: int64

Display the categories of Housing_type

In [37]:
df_cat.Housing_type.value_counts()

Housing_type
House / apartment      8684
With parents            448
Municipal apartment     323
Rented apartment        144
Office apartment         76
Co-op apartment          34
Name: count, dtype: int64

In [38]:
print("Number of categories in 'Housing_type':", len(df_cat.Housing_type.value_counts()))

Number of categories in 'Housing_type': 6


By default, get_dummies creates k new numerical columns for each categorical column where k is the number of categories used in the categorical column with each new column representing a category. However, if parameter drop_first=True is provided as done below, it creates one less new column. For example, for Education_type, it will create 4 new columns instead of 5 (for its 5 categories). In that cases, the absence value (value 0) for the four new columns, would imply the presence of fifth.

By default, the allowable values for each new column are True/False. However, if parameter dtype=int is provided as done below, the allowable column values become 0/1.

Explanation of the get_dummies function:  pd.get_dummies

The pd.get_dummies function is provided by the pandas library. It is used to convert categorical variables into a series of binary (0 or 1) columns, which is known as one-hot encoding.

df_cat is the DataFrame containing the categorical variables you want to encode. In your case, this DataFrame includes columns like Education_type and Housing_type.

dtype=int argument specifies the data type of the resulting dummy variables. By setting dtype=int, you ensure that the binary columns created by get_dummies will contain integers (0s and 1s) instead of the default data type (which might be float).

drop_first=True argument tells get_dummies to drop the first category of each categorical variable. This is done to avoid the dummy variable trap, which occurs when the dummy variables are perfectly collinear. By dropping the first category, you remove this redundancy. For example, if you have a categorical variable with three categories, A, B, and C, get_dummies will create three columns (A, B, C). With drop_first=True, only two columns will be created (B, C), and the presence of A will be implied when both B and C are 0.

In [39]:
df_cat_num= pd.get_dummies(df_cat, dtype=int,drop_first=True)
df_cat_num [0:3]

Unnamed: 0,Education_type_Higher education,Education_type_Incomplete higher,Education_type_Lower secondary,Education_type_Secondary / secondary special,Housing_type_House / apartment,Housing_type_Municipal apartment,Housing_type_Office apartment,Housing_type_Rented apartment,Housing_type_With parents
0,1,0,0,0,0,0,0,1,0
1,0,0,0,1,1,0,0,0,0
2,0,0,0,1,1,0,0,0,0


Separate out the numerical columns into a new dataframe

In [40]:
df_num = df.filter(['Total_income','Age'],axis=1)
df_num [0:3]

Unnamed: 0,Total_income,Age
0,427500.0,32.868574
1,112500.0,58.793815
2,270000.0,52.321403


Concatenate the numerical columns and the categorical columns converted into numerical columns. So, now all the columns are numerical and represent our features. 

In [41]:
X=pd.concat([df_num, df_cat_num],axis=1)
print (X.shape)
X [0:3]

(9709, 11)


Unnamed: 0,Total_income,Age,Education_type_Higher education,Education_type_Incomplete higher,Education_type_Lower secondary,Education_type_Secondary / secondary special,Housing_type_House / apartment,Housing_type_Municipal apartment,Housing_type_Office apartment,Housing_type_Rented apartment,Housing_type_With parents
0,427500.0,32.868574,1,0,0,0,0,0,0,1,0
1,112500.0,58.793815,0,0,0,1,1,0,0,0,0
2,270000.0,52.321403,0,0,0,1,1,0,0,0,0


Split all data (feature and label) into a training part and a testing part.

Function: train_test_split
Is a utility function to split arrays or matrices into random train and test subsets.

Function like of code: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
This line splits the dataset into four subsets:
- training features (X_train)
- testing features (X_test)
- training labels (y_train)
- testing labels (y_test)

Parameter Explanations
X: The input features (independent variables) of the dataset. This is typically a DataFrame or a 2D array.

y: The target variable (dependent variable) of the dataset. This is typically a Series or a 1D array.

test_size=0.2
This specifies the proportion of the dataset to include in the test split.
Value: 0.2 means 20% of the data will be used for testing, and the remaining 80% will be used for training.
Alternative: You can also specify train_size instead of test_size. For example, train_size=0.8 would have the same effect.

Explanation of the parameter: random_state

Purpose
The random_state parameter acts as a seed for the random number generator used by the train_test_split function. This ensures that the split of data into training and testing sets is the same each time you run the code.

Reproducibility
By setting random_state to a specific integer value (e.g., random_state=1), you ensure that the same rows are selected for the training and testing sets every time you execute the code. This is important for reproducibility, especially when sharing code or comparing results across different runs.

Default Behavior
If you do not include the random_state parameter, the train_test_split function will use a different random seed each time it is run. This means the training and testing sets will vary with each execution, leading to potentially different results.

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# randon_state=1 ensures that the same rows are selected for the training and testing sets every time you execute the code, and is useful for debugging purposes.
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


Scale both training and testing data

Scaling the data is a preprocessing step in which the features of the dataset are transformed to have a common scale. This is often done to ensure that no single feature dominates the learning process due to its scale. It is especially important for algorithms that rely on distance measurements (e.g., K-Nearest Neighbors, Support Vector Machines) or those that assume normality (e.g., Principal Component Analysis).

Why Scale the Data?
- Improves Convergence Speed: Algorithms that use gradient descent for optimization (e.g., neural networks) converge faster with scaled data.
- Avoids Dominance by Large-Scale Features: Features with larger scales can dominate the objective function, leading to models that poorly generalize.
- Distance-Based Algorithms: Algorithms like K-Nearest Neighbors and K-Means Clustering are sensitive to the scale of the data because they rely on distance calculations.

Common Scaling Techniques
- Standardization (Z-score normalization): This scales the data to have a mean of 0 and a standard deviation of 1.
- Min-Max Scaling: This scales the data to a fixed range, usually [0, 1].
- Robust Scaling: This scales the data according to the interquartile range, making it robust to outliers.

Module: StandardScaler
Is a standardization tool that scales the data to have a mean of 0 and a standard deviation of 1.

Initializing the Scaler
Class insance: sc = StandardScaler()
This creates an instance of the StandardScaler class. The instance sc will be used to fit the scaler on the training data and transform both the training and testing data.

Fitting and Transforming the Training Data
X_train_scaled = sc.fit_transform(X_train)
".fit_transform" method standardizes the training data based on the statistics (mean and standard deviation) of the training set. This method combines two steps:
1. Fit: It computes the mean and standard deviation of X_train.
2. Transform: It uses these computed values to standardize X_train, scaling each feature to have a mean of 0 and a standard deviation of 1.

Transforming the Testing Data
?????
Current code may be wrong: X_test_scaled = sc.fit_transform(X_test)
Should be: X_test_scaled = sc.transform(X_test)

In [43]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# Fit the scaler on the training data and transform it
X_train_scaled = sc.fit_transform (X_train)
# Use the same scaler to transform the test data
X_test_scaled = sc.transform (X_test)

print (X_train_scaled[0:3])
print('\n')
print (X_test_scaled[0:3])

[[-0.23851238  0.76836105 -0.58354493 -0.19834725 -0.10643366  0.6606597
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]
 [-0.46195437 -0.86702629 -0.58354493 -0.19834725 -0.10643366  0.6606597
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]
 [ 0.43181358 -0.8727113   1.71366412 -0.19834725 -0.10643366 -1.51363856
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]]


[[-0.01507039  1.14451909 -0.58354493 -0.19834725 -0.10643366  0.6606597
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]
 [-0.01507039 -0.49702701 -0.58354493 -0.19834725 -0.10643366  0.6606597
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]
 [-0.68539636  0.64186962 -0.58354493 -0.19834725 -0.10643366  0.6606597
   0.34168778 -0.1838697  -0.08823339 -0.12259181 -0.21884996]]


Train our select model with scaled training data

Models:
    K Neighbors Classifier model
    Support Vector Classifier (SVC) model
    Random Forest Classifier model
    Logistic Regression model

Given this dataset for credit card risk analysis, the best model would be Random Forest Classifier, then SVC, then K Neighbors Classifer, and lastly Logistic Regression.

1. K-Neighbors Classifier (KNN)
Pros:
Simple and intuitive.
No assumptions about data distribution.
Good for smaller datasets.
Cons:
Can be computationally expensive with large datasets.
Sensitive to irrelevant features and feature scaling.
Performance can degrade with high-dimensional data.

2. Support Vector Classifier (SVC)
Pros:
Effective in high-dimensional spaces.
Works well for both linear and non-linear classification with the use of different kernels.
Robust to overfitting, especially in high-dimensional space.
Cons:
Can be memory-intensive and slow with large datasets.
Requires careful tuning of parameters (e.g., kernel type, regularization parameter).
Not inherently suitable for probability estimation.

3. Random Forest Classifier
Pros:
Handles large datasets well.
Reduces overfitting due to ensemble of trees.
Provides feature importance, helping in understanding which features are most influential.
Works well with both numerical and categorical data.
Cons:
Can be computationally intensive with large number of trees.
Interpretability can be an issue compared to simpler models.

4. Logistic Regression
Pros:
Simple and fast.
Provides probability estimates, which can be useful for risk scoring.
Easy to interpret and implement.
Works well with linearly separable data.
Cons:
Assumes linear relationship between features and the log-odds of the target.
Can struggle with non-linear relationships unless feature engineering is applied.
Performance can degrade with highly imbalanced data without proper handling (e.g., class weights, resampling techniques).

In [44]:
# K Neighbors Classifier model
from sklearn.neighbors  import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors= 5)

# Support Vector Classifier (SVC) model
#from sklearn.svm import SVC
#clf = SVC()

# Random Forest Classifier model
#from sklearn.ensemble import RandomForestClassifier
#clf = RandomForestClassifier(n_estimators=500)

# Logistic Regression model
#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression ()

clf.fit (X_train_scaled,y_train)

Test trained model with scaled testing data / Making predictions

In [45]:
y_pred = clf.predict(X_test_scaled)

Produce accuracy score, classification report, and confusion matrix
Evaluating the model

Metrics

Accuracy Score
This metric is the ratio of correctly predicted instances to the total instances.
It is less sensitive to changes in the minority class if the majority class dominates.

Classification Report
Includes precision, recall, and F1-score for each class.
These metrics can show minor changes if the predictions for the minority class change, even if the accuracy remains the same.

Confusion Matrix
Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.
Even small changes in predictions can lead to visible changes in the confusion matrix.

In [46]:
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix

print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
print (confusion_matrix (y_test,y_pred))

0.8583934088568486
              precision    recall  f1-score   support

           0       0.87      0.98      0.92      1693
           1       0.16      0.02      0.04       249

    accuracy                           0.86      1942
   macro avg       0.52      0.50      0.48      1942
weighted avg       0.78      0.86      0.81      1942

[[1661   32]
 [ 243    6]]


Explanations

Accuracy Score
87% of the predictions made by your model were correct.
Accuracy = Total Number of Predictions / Number of Correct Predictions
​Accuracy = TP+TN / TP+TN+FP+FN


​
Classification Report
0 represents individuals who did not default on their credit card payment

Class 0 (No Default)
Precision: 0.87 - Of all the individuals predicted to not default (0), 87% actually did not default.
Recall: 0.98 - Of all the individuals who actually did not default, 99% were correctly predicted as not defaulting.
F1-Score: 0.92 - The balance between precision and recall for individuals who did not default is high.
Support: 1693 - There are 1693 individuals who did not default in the dataset.

Class 1 (Default)
Precision: 0.16 - Of all the individuals predicted to default (1), only 21% actually defaulted.
Recall: 0.02 - Of all the individuals who actually defaulted, only 2% were correctly predicted as defaulting.
F1-Score: 0.04 - The balance between precision and recall for individuals who defaulted is very low.

Overall Metrics
Accuracy: 0.86 - The model correctly predicts 86% of the instances in the dataset.

Macro Average:
Precision: 0.52 - Average precision across both classes.
Recall: 0.50 - Average recall across both classes.
F1-Score: 0.48 - Average F1-score across both classes.

Weighted Average:
Precision: 0.78 - Precision weighted by the number of instances in each class.
Recall: 0.86 - Recall weighted by the number of instances in each class.
F1-Score: 0.81 - F1-score weighted by the number of instances in each class.



Confusion Matrix
Confusion Matrix: [[1661  32]
                   [ 243   6]]
True Positives (TP): 1663
False Positives (FP): 19
False Negatives (FN): 255
True Negatives (TN): 5

TP (True Positives): Instances correctly predicted as the positive class.
TN (True Negatives): Instances correctly predicted as the negative class.
FP (False Positives): Instances incorrectly predicted as the positive class.
FN (False Negatives): Instances incorrectly predicted as the negative class.

In [113]:
# Make a credit card suitability prediction for the provided input

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import LabelEncoder
# from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def predict_credit_card_suitability(total_income, age, education_type, housing_type):
    # Load the dataset
    df = pd.read_csv('k_creditcard.csv')
    df = df.filter(['Total_income','Age','Education_type','Housing_type','Target'])

    y = df.Target # or can write as: y=df['Target']
    df = df.drop('Target', axis=1)

    df_cat = df.filter(['Education_type','Housing_type'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)    
    df_num = df.filter(['Total_income','Age'],axis=1)
    X = pd.concat([df_num, df_cat_num], axis=1)
    
    # or can preprocess the dataset and define features(X) as follows:
    # label_encoders = {}
    # for column in ['Education_type', 'Housing_type']:
    #     label_encoders[column] = LabelEncoder()
    #     df[column] = label_encoders[column].fit_transform(df[column])
    # X = df[['Total_income', 'Age', 'Education_type', 'Housing_type']]
    
    # split the df into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    
    # intiliaze the Scaler, fit and transform the training data, and transform the test data
    sc= StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    X_test_scaled = sc.transform(X_test)
    
    # train the KNeighborsClassifier model
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit (X_train_scaled, y_train)
    
    # skip the evauation of the model as the accuray is high
    
    # Make a prediction for the provided input
    input_data = pd.DataFrame([[total_income, age, education_type, housing_type]], 
                              columns=['Total_income', 'Age', 'Education_type', 'Housing_type'])
    input_data_cat = input_data.filter(['Education_type', 'Housing_type'])
    input_data_cat_num = pd.get_dummies(input_data_cat, dtype=int, drop_first=True)
    input_data_num = input_data.filter(['Total_income', 'Age'], axis=1)
    input_data_final = pd.concat([input_data_num, input_data_cat_num], axis=1)
    
    # Align input_data_final with the training data's columns because of the categorical columns
    missing_cols = set(X.columns) - set(input_data_final.columns)
    for col in missing_cols:
        input_data_final[col] = 0
    input_data_final = input_data_final[X.columns]
    
    # Scale input data using the same scaler fitted on training data
    input_data_scaled = sc.transform(input_data_final)

    # Make a prediction for the provided input
    prediction = clf.predict(input_data_scaled)
    
    # Make a prediction for the provided input - if using label encoders
    # input_data = pd.DataFrame([[Total_income, Age, label_encoders['Education_type'].transform([Education_type])[0], label_encoders['Housing_type'].transform([Housing_type])[0]]],
    #                           columns=['Total_income', 'Age', 'Education_type', 'Housing_type'])
    # prediction = clf.predict(input_data)
    
    # return prediction[0]
    # return 'Suitable' if prediction[0] == 1 else 'Not Suitable'
    return f"{'Suitable' if prediction[0] == 1 else 'Not Suitable'} ({prediction[0]})"

In [114]:
# Examples of making a credit card suitability prediction given the input
print(predict_credit_card_suitability(50000, 30, 'Higher education', 'House / apartment'))
print(predict_credit_card_suitability(120000, 45, 'Secondary / secondary special', 'House / apartment'))
print(predict_credit_card_suitability(750000000, 50, 'Higher education', 'Rented apartment'))
print(predict_credit_card_suitability(350000000, 25, 'Secondary / secondary special', 'House / apartment'))
print('\n')
print(predict_credit_card_suitability(50000, 30, 'Higher education', 'Rented'))
print(predict_credit_card_suitability(120000, 45, 'Secondary / secondary special', 'House / apartment'))
print(predict_credit_card_suitability(75000, 50, 'Higher education', 'Rented'))
print(predict_credit_card_suitability(35000, 25, 'Secondary / secondary special', 'House / apartment'))
print('\n')
print(predict_credit_card_suitability(370000, 30, 'Higher education', 'House / apartment'))
print(predict_credit_card_suitability(120000000, 45, 'Secondary / secondary special', 'House / apartment'))
print(predict_credit_card_suitability(6595000, 50, 'Higher education', 'House / apartment'))
print(predict_credit_card_suitability(3500000000, 25, 'Secondary / secondary special', 'House / apartment'))


Not Suitable (0)
Not Suitable (0)
Suitable (1)
Suitable (1)


Not Suitable (0)
Not Suitable (0)
Not Suitable (0)
Not Suitable (0)


Not Suitable (0)
Suitable (1)
Suitable (1)
Suitable (1)


In [102]:
# Make credit card suitability predictions for a new dataset and save the results to a new CSV

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

def train_knn_classifier():
    # Load the dataset
    df = pd.read_csv('k_creditcard.csv')
    df = df.filter(['Total_income', 'Age', 'Education_type', 'Housing_type', 'Target'])

    y = df['Target']
    df = df.drop('Target', axis=1)

    df_cat = df.filter(['Education_type', 'Housing_type'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)    
    df_num = df.filter(['Total_income', 'Age'], axis=1)
    X = pd.concat([df_num, df_cat_num], axis=1)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

    # Initialize the Scaler, fit and transform the training data, and transform the test data
    sc = StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    X_test_scaled = sc.transform(X_test)

    # Train the KNeighborsClassifier model
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit(X_train_scaled, y_train)

    return clf, sc, X.columns

def predict_credit_card_suitability_from_df(new_df, clf, sc, columns):
    # Preprocess the new data
    df_cat = new_df.filter(['Education_type', 'Housing_type'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)
    df_num = new_df.filter(['Total_income', 'Age'], axis=1)
    input_data_final = pd.concat([df_num, df_cat_num], axis=1)

    # Align input_data_final with the training data's columns because of the categorical columns
    missing_cols = set(columns) - set(input_data_final.columns)
    for col in missing_cols:
        input_data_final[col] = 0
    input_data_final = input_data_final[columns]
    
    # Scale input data using the same scaler fitted on training data
    input_data_scaled = sc.transform(input_data_final)
    
    # Make predictions for the provided input
    predictions = clf.predict(input_data_scaled)
    
    return predictions

# Train the model and scaler
clf, sc, columns = train_knn_classifier()

# Load the new dataset
new_df = pd.read_csv('new_creditcard_applications.csv')
# new_data_path = 'path_to_new_data.csv'
# new_df_with_predictions = pd.read_csv(new_data_path)
new_df_with_predictions = new_df.filter(['Total_income', 'Age', 'Education_type', 'Housing_type'])

# Make predictions and add them to the new dataframe as a new column
new_df_with_predictions['Suitability'] = predict_credit_card_suitability_from_df(new_df_with_predictions, clf, sc, columns)

# Display the first few rows of the updated dataframe
print(new_df_with_predictions.head())

# Save the new dataframe with the predictions to a new CSV file
new_df_with_predictions.to_csv('predicted_suitability.csv', index=False)


   Total_income        Age                 Education_type       Housing_type  \
0      225000.0  22.847834  Secondary / secondary special  House / apartment   
1      135000.0  43.488915  Secondary / secondary special  House / apartment   
2      243000.0  62.155965  Secondary / secondary special  House / apartment   
3       78750.0  22.590471  Secondary / secondary special  House / apartment   
4      144000.0  33.930882  Secondary / secondary special  House / apartment   

   Suitability  
0            0  
1            0  
2            0  
3            0  
4            0  
