# Alexine Studios

![Penguin](https://github.com/chiruharshith/Alexine_Studios/blob/main/media/penguine.png?raw=true)

The dataset consists of the below 7 columns,

- **species:** penguin species (Chinstrap, Adélie, or Gentoo)
- **culmen length & depth:** The culmen is the upper ridge of a bird's beak
- **flipper_length_mm:** flipper length
- **body_mass_g:** body mass
- **island:** island name (Dream, Torgersen, or Biscoe)
- **sex:** penguin sex

## Download Dataset

In [None]:
!mkdir datasets
!wget -qq https://github.com/chiruharshith/Alexine_Studios/blob/main/datasets/penguins.csv -P datasets

#### Importing Required Packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [32]:
# Load the data
df = pd.read_csv('datasets/penguins.csv')
df.head(3)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE


In [33]:
# Count NaN values in each column of the dataframe
df.isna().sum()

species               0
island                0
culmen_length_mm      2
culmen_depth_mm       2
flipper_length_mm     2
body_mass_g           2
sex                  10
dtype: int64

In [34]:
# Print the unique() elements from the sex column 
df['sex'].unique()

array(['MALE', 'FEMALE', nan, '.'], dtype=object)

In [35]:
# Drop the records where sex column has NaN values
df.dropna(subset = ['sex'], inplace = True)

# Print the unique() elements from the sex column after dropping
print("Unique values after dropping NA values : ", df.sex.unique())

Unique values after dropping NA values :  ['MALE' 'FEMALE' '.']


In [36]:
df[df.sex == '.']

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
336,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,.


In [37]:
df.drop(336).sex.unique()

array(['MALE', 'FEMALE'], dtype=object)

In [38]:
df.drop(336, inplace=True)

In [39]:
df.head(2)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE


In [40]:
df['species'] = df['species'].replace(['Adelie','Chinstrap', 'Gentoo'],[0, 1, 2])
df['island'] = df['island'].replace(['Torgersen','Biscoe', 'Dream'],[2, 1, 0])
df['sex'] = df['sex'].replace(['MALE','FEMALE'],[1, 0])

In [41]:
df.head(2)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,1
1,0,2,39.5,17.4,186.0,3800.0,0


In [42]:
# Storing the data and target values in two seperate variable x and y
x = df.drop(['species'], axis=1)
y = df['species']

In [43]:
x.shape, y.shape

((333, 6), (333,))

In [44]:
# We are splitting the data into train and test sets in the ratio of 80:20 
# i.e 80 % of data is train set and 20 % of the data is test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [45]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((266, 6), (67, 6), (266,), (67,))

Know more about [Train_Test_Split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### Training a  Linear Classifier 

In [46]:
from sklearn.linear_model import SGDClassifier
linear_classifier = SGDClassifier()

# Training or fitting the model with the train data
linear_classifier.fit(x_train, y_train)

# Testing the trained model
y_pred = linear_classifier.predict(x_test)

In [47]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_pred, y_test))

0.208955223880597


### Scaling the data

In [49]:
# Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_scale1 = scaler.fit_transform(x_train)
X_test_scale1 = scaler.transform(x_test)

linear_classifier = SGDClassifier()

# Training or fitting the model with the train data
linear_classifier.fit(X_train_scale1, y_train)

# Testing the trained model
y_pred_scale = linear_classifier.predict(X_test_scale1)

print(accuracy_score(y_pred_scale, y_test))

1.0
