In [4]:
import pandas as pd
import numpy as np

Učitavanje podataka - Nivo vlažnosti, Temperatura, Period u danu, Da li je bogomoljka uspešno namamila svoj plen, Vreme utrošeno na lov, Veličina plena, Da li je bogomoljka lovila iz zasede, Da li su bili prisutni drugi predatori, Da li se bogomoljka takmičila za plen, Da li je bogomoljka bila uspešna u lovu
Odluka: Radiće se klasifikacija za Hunting Success, model na osnovu zadatih parametara treba da predvidi da li je lov bio uspešan ili ne

In [5]:
dataset = pd.read_csv('../data/data.csv')

dataset.head(5)

Unnamed: 0,Humidity Level,Temperature,Time of Day,Luring Success,Time Spent Hunting,Prey Size,Ambush,Food Availability,Predator Presence,Is Competition,Hunting Success
0,0.32,31.25,Morning,True,0.81,6.6,False,,True,False,0
1,0.66,29.08,Afternoon,False,2.81,5.38,False,Low,False,False,1
2,0.61,25.74,Evening,True,3.79,15.189246,True,Low,False,False,1
3,0.37,22.68,Morning,False,4.27,5.06,True,Low,True,False,1
4,0.65,29.2,Afternoon,False,3.84,2.71,False,High,False,False,1


Prikaz osnovnih informacija

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Humidity Level      300000 non-null  float64
 1   Temperature         300000 non-null  float64
 2   Time of Day         300000 non-null  object 
 3   Luring Success      300000 non-null  bool   
 4   Time Spent Hunting  300000 non-null  float64
 5   Prey Size           300000 non-null  float64
 6   Ambush              300000 non-null  bool   
 7   Food Availability   269788 non-null  object 
 8   Predator Presence   270292 non-null  object 
 9   Is Competition      269881 non-null  object 
 10  Hunting Success     300000 non-null  int64  
dtypes: bool(2), float64(4), int64(1), object(4)
memory usage: 21.2+ MB


Prikaz sumirane statistike

In [7]:
dataset.describe()

Unnamed: 0,Humidity Level,Temperature,Time Spent Hunting,Prey Size,Hunting Success
count,300000.0,300000.0,300000.0,300000.0,300000.0
mean,0.599884,27.498432,2.729269,5.662011,0.549857
std,0.173421,4.330176,1.406132,2.815613,0.497509
min,0.3,20.0,0.4,1.0,0.0
25%,0.45,23.75,1.55,3.31,0.0
50%,0.6,27.49,2.693402,5.59,1.0
75%,0.75,31.25,3.84,7.88,1.0
max,0.9,35.0,10.189327,20.974719,1.0


<b style="background-color: #470047">Provera koliko duplikata postoji u setu podataka</b>

In [20]:
broj_duplikata = dataset.duplicated().sum()
print(f'Broj duplikata: {broj_duplikata}')

Broj duplikata: 0


Zapažanja: kolone 7, 8 i 9 su tipa object, iako u .csv-u imaju vrednosti TRUE i FALSE. 
Pretpostavka: Fale podaci u nekim redovima za ove vrednosti.
Prikazan broj praznih vrednosti po kolonama:

In [9]:
dataset.isnull().sum()

Humidity Level            0
Temperature               0
Time of Day               0
Luring Success            0
Time Spent Hunting        0
Prey Size                 0
Ambush                    0
Food Availability     30212
Predator Presence     29708
Is Competition        30119
Hunting Success           0
dtype: int64

Raspodela broja redova prema broju nedostajućih vrednosti:

In [10]:
dataset.isnull().sum(axis=1).value_counts()

0    218592
1     73107
2      7971
3       330
Name: count, dtype: int64

Procenat broja redova prema broju nedostajućih vrednosti - tabela

In [11]:
raspodela = dataset.isnull().sum(axis=1).value_counts().sort_index()
ukupan_broj_redova = len(dataset)

tabela = pd.DataFrame({
    'Broj nedostajućih vrednosti': raspodela.index,
    'Broj redova': raspodela.values,
    'Procenat (%)': (raspodela.values / ukupan_broj_redova * 100).round(2)
})

tabela


Unnamed: 0,Broj nedostajućih vrednosti,Broj redova,Procenat (%)
0,0,218592,72.86
1,1,73107,24.37
2,2,7971,2.66
3,3,330,0.11


Zapažanje: Većina redova sa praznim vrednostima ima samo jednu nedefinisanu vrednost.
Može se uraditi: Procenat redova kojima fali više od jedne vrednosti je jako mali(ispod 5%) i njihovo brisanje neće značajno uticati na ishod.
Procenat redova kojima nedostaje jedna vrednost predstavlja veliki deo dataseta i treba ih sačuvati za analizu. Prazne vrednosti se mogu zameniti validnim vrednostima na sledeće načine: 
1. Dopisivanjem FALSE u svako prazno polje - Može iskriviti odnos TRUE/FALSE
2. Probabilistička imputacija - Pronalaženje odnosa vrednosti TRUE/FALSE i popunjavanje dataset-a tako da se održi ova razmera - Očuvava se odnos TRUE/FALSE
3. Model based imputacija - Treniranje modela za popunjavanje vrednosti koje fale

Odabrana je opcija 2.

Brisanje redova sa više nedefinisanih vrednosti:

In [12]:
dataset = dataset[dataset.isnull().sum(axis=1) <= 1]
dataset.info()
dataset.head(5)

<class 'pandas.core.frame.DataFrame'>
Index: 291699 entries, 0 to 299999
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Humidity Level      291699 non-null  float64
 1   Temperature         291699 non-null  float64
 2   Time of Day         291699 non-null  object 
 3   Luring Success      291699 non-null  bool   
 4   Time Spent Hunting  291699 non-null  float64
 5   Prey Size           291699 non-null  float64
 6   Ambush              291699 non-null  bool   
 7   Food Availability   267110 non-null  object 
 8   Predator Presence   267668 non-null  object 
 9   Is Competition      267212 non-null  object 
 10  Hunting Success     291699 non-null  int64  
dtypes: bool(2), float64(4), int64(1), object(4)
memory usage: 22.8+ MB


Unnamed: 0,Humidity Level,Temperature,Time of Day,Luring Success,Time Spent Hunting,Prey Size,Ambush,Food Availability,Predator Presence,Is Competition,Hunting Success
0,0.32,31.25,Morning,True,0.81,6.6,False,,True,False,0
1,0.66,29.08,Afternoon,False,2.81,5.38,False,Low,False,False,1
2,0.61,25.74,Evening,True,3.79,15.189246,True,Low,False,False,1
3,0.37,22.68,Morning,False,4.27,5.06,True,Low,True,False,1
4,0.65,29.2,Afternoon,False,3.84,2.71,False,High,False,False,1


Procentualni odnos vrednosti u kolonama sa True/False podacima

In [13]:
boolean_kolone = []
for kolona in dataset.columns:
    jedinstvene_vrednosti = dataset[kolona].dropna().unique()
    if set(jedinstvene_vrednosti).issubset({0, 1}):
        boolean_kolone.append(kolona)

rezultati = []
for kolona in boolean_kolone:
    ukupno = dataset[kolona].notna().sum()  
    broj_1 = (dataset[kolona] == 1).sum()
    broj_0 = (dataset[kolona] == 0).sum()
    
    proc_1 = (broj_1 / ukupno * 100).round(2) if ukupno > 0 else 0
    proc_0 = (broj_0 / ukupno * 100).round(2) if ukupno > 0 else 0
    
    rezultati.append({
        'Kolona': kolona,
        'Broj vrednosti TRUE/1': broj_1,
        'Broj vrednosti FALSE/0': broj_0,
        'Procenat TRUE/1 (%)': proc_1,
        'Procenat FALSE/0 (%)': proc_0,
        'Ukupno': ukupno
    })

tabela_boolean = pd.DataFrame(rezultati)
tabela_boolean


Unnamed: 0,Kolona,Broj vrednosti TRUE/1,Broj vrednosti FALSE/0,Procenat TRUE/1 (%),Procenat FALSE/0 (%),Ukupno
0,Luring Success,145980,145719,50.04,49.96,291699
1,Ambush,145594,146105,49.91,50.09,291699
2,Predator Presence,133661,134007,49.94,50.06,267668
3,Is Competition,133349,133863,49.9,50.1,267212
4,Hunting Success,160315,131384,54.96,45.04,291699


Prevođenje vrednosti kolona sa True/False u 1/0; 
Boolean atributi su transformisani u numerički oblik (0/1) kako bi se omogućila primena algoritama mašinskog učenja koji zahtevaju numeričke ulaze.

In [14]:
kolone = ['Predator Presence', 'Is Competition']

for kolona in kolone:
    dataset[kolona] = dataset[kolona].replace(r'(?i)^true$', 1, regex=True).replace(r'(?i)^false$', 0, regex=True)

for kolona in kolone:
    dataset[kolona] = pd.to_numeric(dataset[kolona], errors='coerce').astype('Int64')

Probabilistička imputacija za kolone sa nedostajućim vrednostima

In [15]:
kolone_boolean = ['Predator Presence', 'Is Competition']

for kol in kolone_boolean:
    p_true = dataset[kol].mean()
    
    dataset[kol] = dataset[kol].apply(
        lambda x: 1 if pd.notna(x) and x == 1 
        else (1 if pd.isna(x) and np.random.rand() < p_true else 0)
    )

if dataset['Food Availability'].isna().any():
    distribucija = dataset['Food Availability'].value_counts(normalize=True)
    kategorije = distribucija.index.tolist()
    verovatnoce = distribucija.values.tolist()
    
    nan_indeksi = dataset['Food Availability'].isna()
    
    dataset.loc[nan_indeksi, 'Food Availability'] = np.random.choice(
        kategorije, 
        size=nan_indeksi.sum(),
        p=verovatnoce,
    )

Kolone Time of Day and Food Availability su kategorijske i na njih može da se primeni one-hot encoding

In [16]:
time_of_day_one_hot = pd.get_dummies(dataset['Time of Day'], prefix='Time of Day', dummy_na=False)
food_availability_one_hot = pd.get_dummies(dataset['Food Availability'], prefix='Food Availability', dummy_na=False)

dataset = dataset.drop(columns=['Time of Day', 'Food Availability'])

dataset = pd.concat([dataset, time_of_day_one_hot, food_availability_one_hot], axis=1)

ostale_kolone = [col for col in dataset.columns if col not in list(time_of_day_one_hot.columns) + list(food_availability_one_hot.columns)]
hunting_success_index = ostale_kolone.index('Hunting Success')

nove_kolone = ostale_kolone[:hunting_success_index] + list(time_of_day_one_hot.columns) + list(food_availability_one_hot.columns) + ['Hunting Success'] + ostale_kolone[hunting_success_index + 1:]

dataset = dataset[nove_kolone]

dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 291699 entries, 0 to 299999
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Humidity Level            291699 non-null  float64
 1   Temperature               291699 non-null  float64
 2   Luring Success            291699 non-null  bool   
 3   Time Spent Hunting        291699 non-null  float64
 4   Prey Size                 291699 non-null  float64
 5   Ambush                    291699 non-null  bool   
 6   Predator Presence         291699 non-null  int64  
 7   Is Competition            291699 non-null  int64  
 8   Time of Day_Afternoon     291699 non-null  bool   
 9   Time of Day_Evening       291699 non-null  bool   
 10  Time of Day_Morning       291699 non-null  bool   
 11  Food Availability_High    291699 non-null  bool   
 12  Food Availability_Low     291699 non-null  bool   
 13  Food Availability_Medium  291699 non-null  bool  

Transformacija bool TRUE/FALSE vrednosti u numeričke 0/1 int64

In [17]:
dataset = dataset.replace({True: 1, False: 0})

dataset.info()

  dataset = dataset.replace({True: 1, False: 0})


<class 'pandas.core.frame.DataFrame'>
Index: 291699 entries, 0 to 299999
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Humidity Level            291699 non-null  float64
 1   Temperature               291699 non-null  float64
 2   Luring Success            291699 non-null  int64  
 3   Time Spent Hunting        291699 non-null  float64
 4   Prey Size                 291699 non-null  float64
 5   Ambush                    291699 non-null  int64  
 6   Predator Presence         291699 non-null  int64  
 7   Is Competition            291699 non-null  int64  
 8   Time of Day_Afternoon     291699 non-null  int64  
 9   Time of Day_Evening       291699 non-null  int64  
 10  Time of Day_Morning       291699 non-null  int64  
 11  Food Availability_High    291699 non-null  int64  
 12  Food Availability_Low     291699 non-null  int64  
 13  Food Availability_Medium  291699 non-null  int64 

<b style="background-color: #470047">Zaokruživanje na dve decimale</b> 

In [28]:
num_cols = dataset.select_dtypes(include=['float64']).columns

for col in num_cols:
    dataset[col] = dataset[col].round(2)

dataset.head(5)

Unnamed: 0,Humidity Level,Temperature,Luring Success,Time Spent Hunting,Prey Size,Ambush,Predator Presence,Is Competition,Time of Day_Afternoon,Time of Day_Evening,Time of Day_Morning,Food Availability_High,Food Availability_Low,Food Availability_Medium,Hunting Success
0,0.32,31.25,1,0.81,6.6,0,1,0,0,0,1,1,0,0,0
1,0.66,29.08,0,2.81,5.38,0,0,0,1,0,0,0,1,0,1
2,0.61,25.74,1,3.79,15.19,1,0,0,0,1,0,0,1,0,1
3,0.37,22.68,0,4.27,5.06,1,1,0,0,0,1,0,1,0,1
4,0.65,29.2,0,3.84,2.71,0,0,0,1,0,0,1,0,0,1


Čuvanje obrađenih podataka u CSV

In [29]:
dataset.to_csv('../data/data_processed.csv', index=False)
print('Podaci sačuvani u ../data/processed_data.csv')

Podaci sačuvani u ../data/processed_data.csv
