## Text Classification 

### Predicting Category

#### Objectives

On completing the assignment, you will learn how to write a simple AI supervised classification application.

#### Description

Write an AI application which, when provided with a person's demographic, financial, and personal attributes such as age, salary, years employed etc., will predict a person's suitability for credit card issuance. For training and testing the application, please use the labeled data set provided in the file, k_creditcard.csv. The data set contains data regarding 9709 individuals. Each person is labeled as either suitable (assigned Target value of 1) or unsuitable (assigned Target value of 0). Use 80% of the data items for training, and the remaining 20% for testing. Train the sklearn's KNeighborsClassifier classifier with parameter n_neighbors set to 5. After the classifier is trained, test it using the test data and produce accuracy score, classification report, and confusion matrix. Then, optionally try out a few self-created individual values and note the application response. 

The data set provided had 20 columns. However, I used only five of those columns and they are listed below.

Data columns used in this assignment:
Total_income, Age, Education_type, Housing_type, and Target


#### Implementation Notes


#### Dataset source

The data set was downloaded from the Kaggle website below.

https://www.kaggle.com/datasets/rohit265/credit-card-eligibility-data-determining-factors


### Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

#### Coding

Follow the steps below.


Read the dataset from the csv file into a pandas dataframe

In [727]:
import pandas as pd
df=pd.read_csv('k_creditcard.csv')

Find dimensions of the dataframe (rows,columns) and display its fist few rows.

In [729]:
print (df.shape)
df[0:5]

(9709, 20)


Unnamed: 0,ID,Gender,Own_car,Own_property,Work_phone,Phone,Email,Unemployed,Num_children,Num_family,Account_length,Total_income,Age,Years_employed,Income_type,Education_type,Family_status,Housing_type,Occupation_type,Target
0,5008804,1,1,1,1,0,0,0,0,2,15,427500.0,32.868574,12.435574,Working,Higher education,Civil marriage,Rented apartment,Other,1
1,5008806,1,1,1,0,0,0,0,0,2,29,112500.0,58.793815,3.104787,Working,Secondary / secondary special,Married,House / apartment,Security staff,0
2,5008808,0,0,1,0,1,1,0,0,1,4,270000.0,52.321403,8.353354,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,Sales staff,0
3,5008812,0,0,1,0,0,0,1,0,1,20,283500.0,61.504343,0.0,Pensioner,Higher education,Separated,House / apartment,Other,0
4,5008815,1,1,1,1,1,1,0,0,2,5,270000.0,46.193967,2.10545,Working,Higher education,Married,House / apartment,Accountants,0


Determine the number of missing or null entries

In [731]:
df.isna().sum().sum()


0

Display value count for each category for column Target.

Bring out only the needed columns

In [734]:
df=df.filter(['Total_income','Age','Education_type','Housing_type','Target'])

Find dimensions of the dataframe (rows,columns) and display its fist few rows.

In [736]:
print (df.shape)
df[0:3]

(9709, 5)


Unnamed: 0,Total_income,Age,Education_type,Housing_type,Target
0,427500.0,32.868574,Higher education,Rented apartment,1
1,112500.0,58.793815,Secondary / secondary special,House / apartment,0
2,270000.0,52.321403,Secondary / secondary special,House / apartment,0


Display the categories in Target column

In [738]:
df.Target.value_counts()

Target
0    8426
1    1283
Name: count, dtype: int64

Assign the Target column to variable y to be used as labels for the features

In [740]:
y=df.Target
print (type(y))

<class 'pandas.core.series.Series'>


Remove the Target column from among the features

In [742]:
df=df.drop(['Target'],axis=1)
df[0:3]

Unnamed: 0,Total_income,Age,Education_type,Housing_type
0,427500.0,32.868574,Higher education,Rented apartment
1,112500.0,58.793815,Secondary / secondary special,House / apartment
2,270000.0,52.321403,Secondary / secondary special,House / apartment


Bring out only the categorical features for converting them to numerical.

In [744]:
df_cat=df.filter(['Education_type','Housing_type'])
df_cat [0:3]

Unnamed: 0,Education_type,Housing_type
0,Higher education,Rented apartment
1,Secondary / secondary special,House / apartment
2,Secondary / secondary special,House / apartment


Display the categories of Education_type

In [746]:
df_cat.Education_type.value_counts()
#df_cat.groupby('Education_type').count() #alternative form

Education_type
Secondary / secondary special    6761
Higher education                 2457
Incomplete higher                 371
Lower secondary                   114
Academic degree                     6
Name: count, dtype: int64

Display the categories of Housing_type

In [748]:
df_cat.Housing_type.value_counts()

Housing_type
House / apartment      8684
With parents            448
Municipal apartment     323
Rented apartment        144
Office apartment         76
Co-op apartment          34
Name: count, dtype: int64

By default, get_dummies creates k new numerical columns for each categorical column where k is the number of categories used in the categorical column, with each new column representing a category. However, if parameter drop_first=True is provided as is done below, it creates one less new column. For example, for Education_type, it will create 4 new columns instead of 5 (for its 5 categories). In that cases, the absence value (value 0) for the four new columns, would imply the presence of fifth.

By default, the allowable values for each new column are True/False. However, if parameter dtype=int is provided as is done below, the allowable column values become 0/1. 

In [750]:
df_cat_num= pd.get_dummies(df_cat, dtype=int,drop_first=True)
df_cat_num [0:3]

Unnamed: 0,Education_type_Higher education,Education_type_Incomplete higher,Education_type_Lower secondary,Education_type_Secondary / secondary special,Housing_type_House / apartment,Housing_type_Municipal apartment,Housing_type_Office apartment,Housing_type_Rented apartment,Housing_type_With parents
0,1,0,0,0,0,0,0,1,0
1,0,0,0,1,1,0,0,0,0
2,0,0,0,1,1,0,0,0,0


Bring out the numerical columns

In [752]:
df_num = df.filter(['Total_income','Age'],axis=1)
df_num [0:3]

Unnamed: 0,Total_income,Age
0,427500.0,32.868574
1,112500.0,58.793815
2,270000.0,52.321403


Concatenate the numerical columns and the categorical columns(that are  converted to numerical columns). So, now all the columns are numerical and represent our features. 

In [754]:
X = pd.concat([df_num,df_cat_num],axis=1)
print (X.shape)
X [0:3]

(9709, 11)


Unnamed: 0,Total_income,Age,Education_type_Higher education,Education_type_Incomplete higher,Education_type_Lower secondary,Education_type_Secondary / secondary special,Housing_type_House / apartment,Housing_type_Municipal apartment,Housing_type_Office apartment,Housing_type_Rented apartment,Housing_type_With parents
0,427500.0,32.868574,1,0,0,0,0,0,0,1,0
1,112500.0,58.793815,0,0,0,1,1,0,0,0,0
2,270000.0,52.321403,0,0,0,1,1,0,0,0,0


Split all data (feature and label) into a training part and a testing part.

In [756]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=0)

Scale both training and testing data. For the first one call the function fit_transform and the subsequent one use transform as shown below.

In [758]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train_scaled = sc.fit_transform (X_train)
X_test_scaled = sc.transform (X_test)

print (X_train_scaled[0:3])
print (X_test_scaled[0:3])

[[-1.23423518  1.74669168 -0.58731173 -0.19974545  9.28814637 -1.50219965
   0.34613536 -0.18794657 -0.09042992 -0.12420488 -0.22013887]
 [ 0.8865448  -0.11520336 -0.58731173 -0.19974545 -0.10766411  0.66569048
   0.34613536 -0.18794657 -0.09042992 -0.12420488 -0.22013887]
 [-0.24152966 -0.2251708  -0.58731173 -0.19974545 -0.10766411  0.66569048
   0.34613536 -0.18794657 -0.09042992 -0.12420488 -0.22013887]]
[[ 0.8865448  -1.24319554  1.70267331 -0.19974545 -0.10766411 -1.50219965
   0.34613536 -0.18794657 -0.09042992 -0.12420488 -0.22013887]
 [-0.01591477 -1.23540815 -0.58731173 -0.19974545 -0.10766411  0.66569048
  -2.88904316  5.32066105 -0.09042992 -0.12420488 -0.22013887]
 [ 0.43531501 -0.43118276  1.70267331 -0.19974545 -0.10766411 -1.50219965
   0.34613536 -0.18794657 -0.09042992 -0.12420488 -0.22013887]]


Train our select model with scaled training data

In [760]:
from sklearn.neighbors  import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors= 5)

#from sklearn.svm import SVC
#clf = SVC()

#from sklearn.ensemble import RandomForestClassifier
#clf = RandomForestClassifier(n_estimators=500)

#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression ()
X_train_scaled=X_train
clf.fit (X_train_scaled,y_train)

Test trained model with scaled testing data

In [762]:
X_test_scaled=X_test
y_pred = clf.predict(X_test_scaled)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

Produce accuracy score, classification report, and confusion matrix

In [764]:
from sklearn.metrics import accuracy_score, \
classification_report,confusion_matrix

print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
print (confusion_matrix (y_test,y_pred))

0.8583934088568486
              precision    recall  f1-score   support

           0       0.87      0.98      0.92      1691
           1       0.15      0.02      0.04       251

    accuracy                           0.86      1942
   macro avg       0.51      0.50      0.48      1942
weighted avg       0.78      0.86      0.81      1942

[[1662   29]
 [ 246    5]]
