# CMP 310 - Assessment

This is code for a Machine Learning Model with the goal of predicting the island of which a penguin came from.
The target label is **Island**.
The problem at hand is a classificaiton problem as the category **Island** contains categorical data.
The target label is discrete.
The dataset being used is the `modified_penguins.csv` dataset which was provided by the University.

The following block of code defines the libraries used for this project.

In [156]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

Now that the libraries have been added, the data needs to be added.

In [157]:
penguin_data = pd.read_csv('modified_penguins.csv', delimiter=",")
penguin_data.head(25) # Used a larger number to capture a more diverse range of observations

Unnamed: 0,studyName,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0809,Anvers,Biscoe,"Adult, 1 Egg Stage",N5A1,Yes,09/11/2008,46.14,11.44,182.87,5708,FEMALE,8.01979,-26.68311,
1,PAL0910,Anvers,Biscoe,"Adult, 1 Egg Stage",N22A1,Yes,22/11/2009,47.31,16.69,187.81,5660,FEMALE,8.10231,-26.18763,
2,PAL0809,Anvers,Dream,"Adult, 1 Egg Stage",NP3,Yes,03/11/2007,52.42,17.17,181.62,5616,FEMALE,9.5776,-25.53059,
3,PAL0910,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,22/11/2009,49.79,17.94,183.59,5691,FEMALE,8.41151,-26.13832,
4,PAL0910,Anvers,Dream,"Adult, 1 Egg Stage",N96A2,Yes,27/11/2009,39.16,11.02,188.94,4700,FEMALE,9.65061,-24.48153,
5,PAL0708,Anvers,Dream,"Adult, 1 Egg Stage",NP58,No,28/12/2007,46.44,17.71,206.44,4773,FEMALE,7.11315,-26.64442,Sexing primers did not amplify.
6,PAL0910,Anvers,Biscoe,"Adult, 1 Egg Stage",N58A1,Yes,12/11/2009,40.32,18.86,218.31,4973,FEMALE,8.43951,-26.57563,
7,PAL0910,Anvers,Biscoe,"Adult, 1 Egg Stage",N1A2,Yes,18/11/2009,43.65,19.22,224.13,4213,MALE,8.28601,-26.27573,
8,PAL0809,Anvers,Dream,"Adult, 1 Egg Stage",NP75,Yes,14/01/2008,36.35,21.13,213.65,5167,FEMALE,8.0395,-25.21259,
9,PAL0708,Anvers,Dream,"Adult, 1 Egg Stage",NP81,No,20/01/2008,39.3,10.13,178.26,5596,FEMALE,9.86965,-24.71691,


## Observations from `head()` 

After displaying 25 rows the following can be observed:

### Overview

- There is 15 columns containing both numerical and categorical features. 
- The first few rows do not show missing features but after furhter inspection, missing features are spotted.

### Specfic Observations

- There are features that contain data with little value such as:
    - **Region**, **Stage**, **Comments** which contain either single or missing features which will not provide any value to training a model.
    - **Individual ID** which contains features that are too sparse and are high in dimensionality.
    - **studyName** does not provide any value to the data set. <!--Must consider...-->
    - **Date Egg** does not provide any meaningful data for predicting the target label.
- **Sex** is contains missing features and must be handled appropriately.

- Further exploration is required to detect duplicate rows or inconsistent data.

The next step is to remove features that don't provide any value to the model.

In [158]:
penguin_data = penguin_data.drop(columns=['studyName', 'Region', 'Stage', 'Comments', 'Individual ID', 'Date Egg'])
penguin_data.head()

Unnamed: 0,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo)
0,Biscoe,Yes,46.14,11.44,182.87,5708,FEMALE,8.01979,-26.68311
1,Biscoe,Yes,47.31,16.69,187.81,5660,FEMALE,8.10231,-26.18763
2,Dream,Yes,52.42,17.17,181.62,5616,FEMALE,9.5776,-25.53059
3,Biscoe,Yes,49.79,17.94,183.59,5691,FEMALE,8.41151,-26.13832
4,Dream,Yes,39.16,11.02,188.94,4700,FEMALE,9.65061,-24.48153


Next, further exploration is required ot remove any possible duplicate rows and inconsistent data.A

In [159]:
# Drop duplicates
penguin_data.drop_duplicates()

# Checking for inconsistencies in categorical data
print('Unique values of \'Sex\': ', penguin_data['Sex'].unique())
print('Unique values of \'Clutch Completion\'', penguin_data['Clutch Completion'].unique())

print('Unique values of \'Sex\': ', penguin_data['Island'].unique())

penguin_data['Sex'] = penguin_data['Sex'].replace('.', np.nan)

# Check for missing values
print('Checking for missing values.')
print(penguin_data.isnull().sum())

# Handle missing data
penguin_data['Culmen Length (mm)'] = penguin_data['Culmen Length (mm)'].fillna(penguin_data['Culmen Length (mm)'].mean())
penguin_data['Culmen Depth (mm)'] = penguin_data['Culmen Depth (mm)'].fillna(penguin_data['Culmen Depth (mm)'].mean())
penguin_data['Flipper Length (mm)'] = penguin_data['Flipper Length (mm)'].fillna(penguin_data['Flipper Length (mm)'].mean())
penguin_data['Sex'] = penguin_data['Sex'].fillna(penguin_data['Sex'].mode()[0])
penguin_data['Delta 15 N (o/oo)'] = penguin_data['Delta 15 N (o/oo)'].fillna(penguin_data['Delta 15 N (o/oo)'].mean())
penguin_data['Delta 13 C (o/oo)'] = penguin_data['Delta 13 C (o/oo)'].fillna(penguin_data['Delta 13 C (o/oo)'].mean())

# Check for missing values
print('-----')
print('After filling missing values.')
print(penguin_data.isnull().sum())


Unique values of 'Sex':  ['FEMALE' 'MALE' nan '.']
Unique values of 'Clutch Completion' ['Yes' 'No']
Unique values of 'Sex':  ['Biscoe' 'Dream' 'Torgersen']
Checking for missing values.
Island                  0
Clutch Completion       0
Culmen Length (mm)      4
Culmen Depth (mm)       2
Flipper Length (mm)     2
Body Mass (g)           0
Sex                    24
Delta 15 N (o/oo)      14
Delta 13 C (o/oo)      13
dtype: int64
-----
After filling missing values.
Island                 0
Clutch Completion      0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Sex                    0
Delta 15 N (o/oo)      0
Delta 13 C (o/oo)      0
dtype: int64


Now that the data has been cleansed a little it can now be visualised to assist in understanding the distribution of the relationships between variables and potential outliers or patterns.

To ensure there are no issues with the data, the categorical data of the Model must be encoded so that the machine can understand the data.

To do this `LabelEncoder` can be used from `sklearn.preprocessing` to convert attributes to numerical data.

Before doing this, it is best to split the variables and seperate the independent variables from the dependent variables.

In [None]:
X = penguin_data[['Clutch Completion', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)']].copy
Y = penguin_data[['Island']]

In [167]:
sex_encoder = LabelEncoder()
cc_encoder = LabelEncoder()
X.loc[:, 'Sex'] = sex_encoder.fit_transform(penguin_data['Sex'])
X.loc[:, 'Clutch Completion'] = cc_encoder.fit_transform(penguin_data['Clutch Completion'])

# To see if it worked
print('Encoded Values of \'Sex\'')
print('-----')
print(X['Sex'].unique())
print('Encoded Values of \'Clutch Completion\'')
print('-----')
print(X['Clutch Completion'].unique())


Encoded Values of 'Sex'
-----
[0 1]
Encoded Values of 'Clutch Completion'
-----
[1 0]


In [None]:
island_encoder = LabelEncoder()
Y.loc[:, 'Island'] = island_encoder.fit_transform(Y['Island'])

# To check if it worked
print(Y['Island'].unique())

[0 1 2]


## Visualisations

Before pre-processing, it's best to plot the data as to attempt to identify patterns, relationships and potential issues amongst the data such as outliers or missing values.

Using `Seaborn (sns)` and `Matplotlib (plt)`, various visualizations can be created to explore and analyze the data, allowing for meaningful observations.
One technique that could help to identify outliers is a Correlation Heatmap to show which features are related.

## Visualisation

For this, a Heatmap Correlation would be most suitable to show the correlation between features.
<!--Eloborate...-->

In [163]:
corr = X.corr()

## References

- https://community.ibm.com/community/user/ai-datascience/blogs/shivam-solanki1/2020/02/19/eda-exploratory-data-analysis-with-example-in-jupy 