
## Text Classification 

### Predicting Category

#### Objectives

On completing the assignment, you will be able to write a simple AI  application involving supervised classification.

#### Description

Write an AI application which, when provided with a penguin's attributes such as its bill length, bill depth, flipper length etc, it will predict the species to which the penguin belongs. For training and testing the application, please use the labeled data set provided in the file, sn_penguin.csv. The data, provided, contains data for about 344 penguins belonging to 3 different species named Adelie, Chinstrap, and Gentoo. Each data item is labeled with the name of the species to which it belongs. Use 80% of the data items for training, and the remaining 20% for testing. Use sklearn's KNeighborsClassifier classifier with its parameter n_neighbors set to 5 train the classifier with the training data. After the classifier is trained, test the classifier using the testing data and produce accuracy score, classification report, and confusion matrix. Then, optionally try out a few self-created individual values and note the application response. 


#### Implementation Notes


#### Dataset source

The data set used is downloaded from the Seaborn website .


### Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and the user interaction.
  
- the  html file corresponding to the above jpynb file.

#### Coding

Follow the steps of assignment 1.
Make sure to detect the data items containing null values and remove them.



## Keith Yrisarri Stateson
June 19, 2024. Python 3.11.0

## Title: Penguin Species Classification Using K-Nearest Neighbors (KNN)

## Summary
This project aims to build an AI application to classify penguin species based on various physical attributes such as bill length, bill depth, flipper length, body mass, and sex. The application uses a dataset containing measurements of 344 penguins from three species: Adelie, Chinstrap, and Gentoo. The project involves data cleaning, feature preprocessing, model training using the K-Nearest Neighbors (KNN) algorithm, and model evaluation using accuracy score, classification report, and confusion matrix.

Read data from csv file to a pandas data frame

Without index_col
A default integer index is created, and all columns in the CSV are treated as regular data columns.

With index_col=0
The first column of the CSV is used as the index of the DataFrame, providing meaningful row labels and potentially enhancing data manipulation and access.

In [6]:
import pandas as pd

df=pd.read_csv('sn_penguin.csv',index_col=0)
print (df.shape)
df [0:3]

(344, 7)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female


Detect data samples with entries containing null

df.isna(): Returns a DataFrame of the same shape as df, with True for NaNs and False otherwise. The result of df.isna().sum().sum() is managed by pandas but computed using NumPy. The total count of NaNs is stored as a NumPy 64-bit integer (np.int64).

na stands for 'not available'

NaN stands for "Not a Number." It is a special floating-point value defined by the IEEE 754 standard used to represent undefined or unrepresentable numerical results, such as the result of 0/0 or the square root of a negative number.

first .sum(): Sums down each column, resulting in a Series with the count of NaNs for each column.

second .sum(): Sums the Series obtained from the first sum, resulting in a single scalar value representing the total count of NaNs in the DataFrame.


In [9]:
df.isna().sum().sum()

np.int64(0)

Drop data samples with entries containing null

df = df.dropna()
- Creates a Copy: This operation does not modify the original DataFrame df. Instead, it creates a new DataFrame with the rows containing NaN values removed.
- Reassignment Needed: To update the original DataFrame, you need to reassign the result back to df or another variable.
- df_cleaned = df.dropna(). In this case, its just rewriting over the same df with df = df.df.dropna()

df.dropna(inplace=True)
- Modifies In-Place: This operation modifies the original DataFrame df directly, removing the rows containing NaN values.
- No Reassignment Needed: Since the modification is in-place, there is no need to reassign the result to df.


In [10]:
df=df.dropna()
df.isna().sum().sum()

np.int64(0)

display shape and first few samples

In [11]:
print (df.shape)
df [0:3]

(333, 7)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female


In [12]:
y=df.species
print(type(y))

<class 'pandas.core.series.Series'>


In [14]:
df=df.drop('species',axis=1)
df[0:3]


Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Torgersen,40.3,18.0,195.0,3250.0,Female


In [17]:
# create a dataframe of the categorical variables
df_cat=df.filter(['island', 'sex'])
df_cat[0:3]

Unnamed: 0,island,sex
0,Torgersen,Male
1,Torgersen,Female
2,Torgersen,Female


In [21]:
df_cat.value_counts('island')


island
Biscoe       163
Dream        123
Torgersen     47
Name: count, dtype: int64

In [22]:
df_cat.value_counts('sex')

sex
Male      168
Female    165
Name: count, dtype: int64

In [37]:
df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)
df_cat_num[0:3]

Unnamed: 0,island_Dream,island_Torgersen,sex_Male
0,0,1,1
1,0,1,0
2,0,1,0


In [38]:
df_num=df.filter(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'])
df_num[0:3]

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0


In [42]:
# combine the numerical and dummy variables into a single dataframe
X = pd.concat([df_num, df_cat_num], axis=1)
print(X.shape)
X[0:3]

(333, 7)


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Dream,island_Torgersen,sex_Male
0,39.1,18.7,181.0,3750.0,0,1,1
1,39.5,17.4,186.0,3800.0,0,1,0
2,40.3,18.0,195.0,3250.0,0,1,0


In [43]:
# split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [44]:
# intiliaze the Scaler, fit and transform the training data, and transform the test data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

print(X_train_scaled[0:3])
print('\n')
print(X_test_scaled[0:3])

[[-0.08036036 -1.40321023  1.18032759  0.46895167 -0.75146915 -0.39562828
  -0.99250926]
 [ 1.1685254   1.48583187 -0.27322398 -1.10774729  1.33072662 -0.39562828
   1.00754728]
 [ 1.62102024  1.48583187  0.28050995  0.40830941  1.33072662 -0.39562828
   1.00754728]]


[[-0.40615665  0.45403112 -0.61930769 -0.31939781 -0.75146915 -0.39562828
   1.00754728]
 [-0.00796119 -1.66116042  0.48816018  0.10509807 -0.75146915 -0.39562828
  -0.99250926]
 [ 1.05992664  0.76357135 -0.41165746 -0.74389368  1.33072662 -0.39562828
   1.00754728]]


In [45]:
# train the select model with scaled training data and make predictions on the scaled test data
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)

# Support Vector Classifier (SVC) model
#from sklearn.svm import SVC
#clf = SVC()

# Random Forest Classifier model
#from sklearn.ensemble import RandomForestClassifier
#clf = RandomForestClassifier(n_estimators=500)

# Logistic Regression model
#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression ()

clf.fit (X_train_scaled, y_train)    

In [46]:
# Test the trained model with the scaled test data / make predictions based on the test data
y_pred = clf.predict(X_test_scaled)

In [49]:
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, y_pred))
print('\n')
print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
print('\n')
# results in an error, and I don't know how to fix it: print('ROC AUC: ', roc_auc_score(y_test, y_pred))

Accuracy:  1.0


Classification Report: 
               precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        28
   Chinstrap       1.00      1.00      1.00        17
      Gentoo       1.00      1.00      1.00        22

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



Confusion Matrix: 
 [[28  0  0]
 [ 0 17  0]
 [ 0  0 22]]




In [1]:
# Predict Penguin Species for the provided input

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

def predict_penguin_species(island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex):
    # Load the dataset
    df = pd.read_csv('sn_penguin.csv')
    df=df.dropna()

    y=df['species']
    df = df.drop('species', axis=1)

    df_cat = df.filter(['island','sex'])
    df_cat_num = pd.get_dummies(df_cat, dtype=int, drop_first=True)    
    df_num = df.filter(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'])
    X = pd.concat([df_num, df_cat_num], axis=1)

    # split the df into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # intiliaze the Scaler, fit and transform the training data, and transform the test data
    sc= StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    X_test_scaled = sc.transform(X_test)
    
    # train the KNeighborsClassifier model
    clf = KNeighborsClassifier(n_neighbors=5)
    clf.fit (X_train_scaled, y_train)
    
    # skip the evauation of the model as the accuray is high
    
    # Make a prediction for the provided input
    input_data = pd.DataFrame([[island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex]], 
                              columns=['island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'])
    input_data_cat = input_data.filter(['island','sex'])
    input_data_cat_num = pd.get_dummies(input_data_cat, dtype=int, drop_first=True)
    input_data_num = input_data.filter(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], axis=1)
    input_data_final = pd.concat([input_data_num, input_data_cat_num], axis=1)
    
    # Align input_data_final with the training data's columns because of the categorical columns
    missing_cols = set(X.columns) - set(input_data_final.columns)
    for col in missing_cols:
        input_data_final[col] = 0
    input_data_final = input_data_final[X.columns]
    
    # Scale input data using the same scaler fitted on training data
    input_data_scaled = sc.transform(input_data_final)

    # Make a prediction for the provided input
    prediction = clf.predict(input_data_scaled)
    
    return prediction[0]
    # return 'Suitable' if
    # prediction[0] == 1 else 'Not Suitable'
    # return f"{'Suitable' if prediction[0] == 1 else 'Not Suitable'} ({prediction[0]})"


In [2]:
# Examples of prediction a penguin's species given the input
print(predict_penguin_species('Torgersen', 100, 18, 178, 3880, 'Male'))
print(predict_penguin_species('Torgersen', 34, 18, 189, 3880, 'Female'))
print(predict_penguin_species('Dream', 300, 16, 168, 3880, 'Male'))
print(predict_penguin_species('Dream', 39, 15, 164, 3880, 'Female'))
print(predict_penguin_species('Biscoe', 1, 24, 500, 2000, 'Male'))
print(predict_penguin_species('Biscoe', 38, 26, 172, 3880, 'Female'))

Chinstrap
Adelie
Chinstrap
Adelie
Gentoo
Adelie
