<a href="https://colab.research.google.com/github/Zorcaris/NeuralNetworksAndDeepLearning/blob/main/Tabular_data_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular data classification Kaggle
*Rice type classification*

**1: Set up**

In [1]:
!pip install opendatasets --quiet
import opendatasets as od
od.download("https://www.kaggle.com/datasets/mssmartypants/rice-type-classification")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: Zorcaris
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/mssmartypants/rice-type-classification


In [2]:
import torch # Torch main framework
import torch.nn as nn # Used for getting the NN Layers
from torch.optim import Adam # Adam Optimizer
from torch.utils.data import Dataset, DataLoader # Dataset class and DataLoader for creatning the objects
from torchsummary import summary # Visualize the model layers and number of parameters
from sklearn.model_selection import train_test_split # Split the dataset (train, validation, test)
from sklearn.metrics import accuracy_score # Calculate the testing Accuracy
import matplotlib.pyplot as plt # Plotting the training progress at the end
import pandas as pd # Data reading and preprocessing
import numpy as np # Mathematical operations

device = 'cuda' if torch.cuda.is_available() else 'cpu' # detect the GPU if any, if not use CPU, change cuda to mps if you have a mac

**2: Gather data**

In [3]:
data_df = pd.read_csv("/content/rice-type-classification/riceClassification.csv")
# 1 Drop useless column
data_df.dropna(inplace = True) # Dropna- Missing && null values
data_df.drop(["id"], axis =1, inplace = True) # Id

# 2 Check the shape of the dataset
print("Data Shape (rows, cols): ", data_df.shape) # Why in Class?

#3 Principle outputs
print("Output possibilities: ", data_df["Class"].unique())

data_df.head() # Print the dataset

Data Shape (rows, cols):  (18185, 11)
Output possibilities:  [1 0]


Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,EquivDiameter,Extent,Perimeter,Roundness,AspectRation,Class
0,4537,92.229316,64.012769,0.719916,4677,76.004525,0.657536,273.085,0.76451,1.440796,1
1,2872,74.691881,51.400454,0.725553,3015,60.471018,0.713009,208.317,0.831658,1.453137,1
2,3048,76.293164,52.043491,0.731211,3132,62.296341,0.759153,210.012,0.868434,1.46595,1
3,3073,77.033628,51.928487,0.738639,3157,62.5513,0.783529,210.657,0.870203,1.483456,1
4,3693,85.124785,56.374021,0.749282,3802,68.571668,0.769375,230.332,0.874743,1.51,1


**3: Data normalization**

Normalise to reduce en amount of values to process, and possibly the loss will round up said values.

***Min-Max Scalin :***

$$ \frac{All \ values \ in \ column}{Largest \ value} $$
$$ \\ $$

*   Each value in the column is divided by the largest value in that column.
*   To scales all values between **0 and 1**.
*   For **positive** values.

$$ X_{\text{normalized}} = \frac{X}{X_{\max} } $$

$$ \\ $$


*   In the case of **negative** numbers.

$$ X' = \frac{X -  X_{\text{min}}}{X_{\max} - X_{\text{min}}} $$

$$ \\ $$

*   Other case of **Z-score** normalization.

$$ X' = \frac{X -  μ }{σ} $$


In [4]:
original_df = data_df.copy() # Creating a copy of the original Dataframe to use to normalize inference

for column in data_df.columns:
    data_df[column] = data_df[column]/data_df[column].abs().max()
data_df.head() # Check the changes

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,EquivDiameter,Extent,Perimeter,Roundness,AspectRation,Class
0,0.444368,0.503404,0.775435,0.744658,0.424873,0.66661,0.741661,0.537029,0.844997,0.368316,1.0
1,0.281293,0.407681,0.622653,0.750489,0.273892,0.53037,0.80423,0.409661,0.919215,0.371471,1.0
2,0.298531,0.416421,0.630442,0.756341,0.28452,0.54638,0.856278,0.412994,0.959862,0.374747,1.0
3,0.300979,0.420463,0.629049,0.764024,0.286791,0.548616,0.883772,0.414262,0.961818,0.379222,1.0
4,0.361704,0.464626,0.682901,0.775033,0.345385,0.601418,0.867808,0.452954,0.966836,0.386007,1.0


**4: Splitting the data**

In [10]:
# 1 Indexing
# .iloc is a Panda dataframe function used for indexing
X = np.array(data_df.iloc[:,:-1]) # Add all the row and all the column until the last column
Y = np.array(data_df.iloc[:,-1])  # Add all the row and and the last column

# To visualise what .iloc does
#print (X)

# 2 Splitting
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3) # Create a 70% training split
X_test, X_validation , y_test, y_validation = train_test_split(X_test, y_test, test_size = 0.5) # Create a 50% of 30%: 15% validation and training split

print("Training set is: ", X_train.shape[0], " rows which is ", round(X_train.shape[0]/data_df.shape[0],4)*100, "%")
print("Validation set is: ",X_validation.shape[0], " rows which is ", round(X_validation.shape[0]/data_df.shape[0],4)*100, "%")
print("Testing set is: ",X_test.shape[0], " rows which is ", round(X_test.shape[0]/data_df.shape[0],4)*100, "%")

Training set is:  12729  rows which is  70.0 %
Validation set is:  2728  rows which is  15.0 %
Testing set is:  2728  rows which is  15.0 %


train_test_split()
function from Scikit-Learn
 to evaluate model performance by training on one subset of data and testing on another.

**5: Convertion of our data toward Pytorch Dataset**

In [14]:
class dataset(Dataset): # Dataset from Pytorch import
  def __init__(self, X, Y):
    # Convert Panda data to Pytorch tensor
        self.X = torch.tensor(X, dtype = torch.float32).to(device) # .to(Device) send data to GPU (CUDA) for faster computation
        self.Y = torch.tensor(Y, dtype = torch.float32).to(device)
  def __len__(self):
        return len(self.X)
  def __getitem__(self, index):
        return self.X[index], self.Y[index]

training_data = dataset(X_train, y_train)
validation_data = dataset(X_validation, y_validation)
testing_data = dataset(X_test, y_test)

#print (training_data[0]) # To visualise dataset
# https://youtu.be/E0bwEAWmVEM?si=gqxo-3JufCJ3-3yI&t=1857 Finish splitting + convertion explations

(tensor([0.6230, 0.7504, 0.7223, 0.9320, 0.5888, 0.7893, 0.7007, 0.6307, 0.8590,
        0.5895]), tensor(0.))
