# Data transformation and Sampling

Throughout this notebook, we will transform categorical data into numerical data and scale numerical data for the training set. We will use the following techniques:

- MinMaxScaler for numerical data
- LabelEncoder for class
- Dummy creation for categorical data

We will sample the data to work with a smaller dataset.

We will save the transformed data into [Data/WorkingData](https://github.com/faendal/MushroomEdibilityPrediction/tree/main/Data/WorkingData) folder.

Finally, we will save the transformation objects into [Models/Transformations](https://github.com/faendal/MushroomEdibilityPrediction/tree/main/Models/Transformations) folder.

In [116]:
import pickle
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split

In [117]:
df = pd.read_csv('../Data/FullyClean/train.csv')

In [118]:
# Data type correction

types = {
    column: "category"
    for column in df.columns
    if df[column].dtype == "object"
}
df = df.astype(types)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1818187 entries, 0 to 1818186
Data columns (total 8 columns):
 #   Column           Dtype   
---  ------           -----   
 0   class            category
 1   cap-surface      category
 2   cap-color        category
 3   gill-attachment  category
 4   gill-color       category
 5   stem-height      float64 
 6   stem-width       float64 
 7   stem-color       category
dtypes: category(6), float64(2)
memory usage: 38.1 MB


In [119]:
# Normalization

minmax = MinMaxScaler()

df[["stem-height", "stem-width"]] = minmax.fit_transform(df[["stem-height", "stem-width"]])

In [120]:
# Label encoding
labelEncoder = LabelEncoder()

df["class"] = labelEncoder.fit_transform(df["class"])

In [121]:
categorical = df.select_dtypes(include="category").columns

for column in categorical:
    unique = df[column].nunique()
    print(f"{column}: {unique}")

cap-surface: 11
cap-color: 8
gill-attachment: 6
gill-color: 7
stem-color: 7


In [122]:
df = pd.get_dummies(df, columns=categorical, drop_first=False, dtype=int)

## Data Sampling

We currently have over 1.8 Million rows and 57 colums. This is a lot of data to work with. We will sample the data to reduce the size of the dataset and save it into the directory mentioned before

In [129]:
X = df.drop(columns=["class"])
y = df["class"]

# Sampling
X_sampled, _x, y_sampled, _y = train_test_split(X, y, test_size=0.7690006, stratify=y)

# DataFrame creation from sampled data
df_sampled = X_sampled.copy()
df_sampled["class"] = y_sampled
print(df_sampled.shape)

(420000, 42)


In [130]:
df_sampled.to_csv("../Data/WorkingData/train.csv", index=False)

In [131]:
filename = "../Models/Transformations/transformations.pkl"
variables = X.columns._values
pickle.dump([minmax, labelEncoder], open(filename, "wb"))