# Eksploratorn analiza

Problem kod dataseta u ovom zadatku je što imamo pristup samo train set-u, nemamo validacijskom i test setu. Broj redova, primjera u train setu je 8000 te sam podijelio podatke na sljedeći način: 6400 train set, 800 dev set i 800 test set.

In [3]:
import pandas as pd

### Puni dataset

In [4]:
data = pd.read_csv('data/dataset.csv')
data

Unnamed: 0,id,text,is_humor,humor_rating,humor_controversy,offense_rating
0,1,TENNESSEE: We're the best state. Nobody even c...,1,2.42,1.0,0.20
1,2,A man inserted an advertisement in the classif...,1,2.50,1.0,1.10
2,3,How many men does it take to open a can of bee...,1,1.95,0.0,2.40
3,4,Told my mom I hit 1200 Twitter followers. She ...,1,2.11,1.0,0.00
4,5,Roses are dead. Love is fake. Weddings are bas...,1,2.78,0.0,0.10
...,...,...,...,...,...,...
7995,7996,Lack of awareness of the pervasiveness of raci...,0,,,0.25
7996,7997,Why are aspirins white? Because they work sorry,1,1.33,0.0,3.85
7997,7998,"Today, we Americans celebrate our independence...",1,2.55,0.0,0.00
7998,7999,How to keep the flies off the bride at an Ital...,1,1.00,0.0,3.00


Značenje stupaca:

- id - Ovo je identifikacijski broj za svaku recenicu. Može se koristiti za jedinstveno identificiranje svake stavke u skupu podataka.
- text - Ovaj stupac sadrži rečenice koje je potrebno analizirati.
- is_humor - binarna oznaka (0 ili 1) koja označava ima li rečenica humor ili ne. Ako je vrijednost 1, rečenica je označena kao humoristična, ako je 0, rečenica nije .
- humor_rating - Numerička ocjena (1-5) koja predstavlja subjektivnu percepciju anotatora o tome koliko je rečenica smiješna. Anotatori su ocijenili smiješnost rečenice na skali od 1 do 5.
- humor_controversy - Binarna oznaka (0 ili 1) koja označava ima li kontroverzu humora u rečenici. Ako je vrijednost 1, to znači da je ocjena humora za tu rečenicu kontroverzna.
- offense_rating - Numerička ocjena (1-5) koja predstavlja subjektivnu percepciju anotatora o tome koliko je rečenica uvredljiva. Anotatori su ocijenili razinu uvredljivosti rečenice na skali od 1 do 5. Ovdje se također razmatra da nedavanje ocjene jednako 0.

In [5]:
print(data.describe())
print()
print()
print(f"Broj humoristicnih tekstova: {len(data[data['is_humor'] == 1])}")
print(f"Broj ne humoristicnih: {len(data[data['is_humor'] == 0])}")
print(f"Broj NaN zapisa: {len(data[data['is_humor'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['humor_rating'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['humor_controversy'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['offense_rating'].isna()])}")

               id     is_humor  humor_rating  humor_controversy  \
count  8000.00000  8000.000000   4932.000000        4932.000000   
mean   4000.50000     0.616500      2.260525           0.499797   
std    2309.54541     0.486269      0.566974           0.500051   
min       1.00000     0.000000      0.100000           0.000000   
25%    2000.75000     0.000000      1.890000           0.000000   
50%    4000.50000     1.000000      2.280000           0.000000   
75%    6000.25000     1.000000      2.650000           1.000000   
max    8000.00000     1.000000      4.000000           1.000000   

       offense_rating  
count     8000.000000  
mean         0.585325  
std          0.979955  
min          0.000000  
25%          0.000000  
50%          0.100000  
75%          0.700000  
max          4.850000  


Broj humoristicnih tekstova: 4932
Broj ne humoristicnih: 3068
Broj NaN zapisa: 0
Broj NaN zapisa: 3068
Broj NaN zapisa: 3068
Broj NaN zapisa: 0


In [6]:
# Provjerava ima li unos u svakom redu za 'text' stupac
text_column_not_null = data['text'].dropna()

# Ispisuje duljinu rezultirajućeg DataFrame-a
print(f"Broj redaka bez NaN vrijednosti u 'text' stupcu: {len(text_column_not_null)}")

Broj redaka bez NaN vrijednosti u 'text' stupcu: 8000


In [7]:
# Udio kontroverznosti humora
controversial_count = data['humor_controversy'].sum()
total_samples = len(data)

print(f"Udio kontroverznosti humora: {controversial_count / total_samples * 100:.2f}%")

Udio kontroverznosti humora: 30.81%


In [8]:
# Analiza duljine rečenica
data['sentence_length'] = data['text'].apply(lambda x: len(x.split()))
print(data[['text', 'sentence_length']].head())
print(data[['sentence_length']].mean())
print(data.groupby('is_humor')['sentence_length'].mean())

                                                text  sentence_length
0  TENNESSEE: We're the best state. Nobody even c...               17
1  A man inserted an advertisement in the classif...               32
2  How many men does it take to open a can of bee...               26
3  Told my mom I hit 1200 Twitter followers. She ...               26
4  Roses are dead. Love is fake. Weddings are bas...               12
sentence_length    20.889375
dtype: float64
is_humor
0    21.932855
1    20.240268
Name: sentence_length, dtype: float64


Training set ne sadrži neispravne primjere. Gdje su vrijednosti is_humor == 0, tj. za tekstove koji nisu humoristični nema vrijednosti humor_rating	i humor_controversy jer to za njih niti nije moguće izračunati.

### Podjela dataset-a

In [9]:
from sklearn.model_selection import StratifiedShuffleSplit

# Assuming you have a dataframe 'data' with features and labels
# X contains your features, y contains your labels

# Create an instance of StratifiedShuffleSplit for splitting into train and temp sets
stratified_splitter_train_temp = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Use the splitter to generate indices for train and temp sets
for train_index, temp_index in stratified_splitter_train_temp.split(data, data['is_humor']):
    train_data, temp_data = data.iloc[train_index], data.iloc[temp_index]

# Create an instance of StratifiedShuffleSplit for further splitting temp into dev and test sets
stratified_splitter_temp_dev_test = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)

# Use the splitter to generate indices for dev and test sets
for dev_index, test_index in stratified_splitter_temp_dev_test.split(temp_data, temp_data['is_humor']):
    dev_data, test_data = temp_data.iloc[dev_index], temp_data.iloc[test_index]

# Print the sizes of the obtained sets
print(f"Size of train set: {len(train_data)}")
print(f"Size of dev set: {len(dev_data)}")
print(f"Size of test set: {len(test_data)}")


Size of train set: 6400
Size of dev set: 800
Size of test set: 800


In [10]:
# Spremi train set u CSV file
train_data.to_csv('data/train.csv', index=False)

# Spremi dev set u CSV file
dev_data.to_csv('data/dev.csv', index=False)

# Spremi test set u CSV file
test_data.to_csv('data/test.csv', index=False)

### Odnos humorističnih i nehumorističnih tekstova u train i dev setu

In [1]:
train_data

NameError: name 'train_data' is not defined

In [12]:
# Broj humorističnih tekstova u train setu
humor_percent = len(train_data[train_data['is_humor'] == 1]) / len(train_data) * 100

# Broj nehumorističnih tekstova u train setu
non_humor_percent = len(train_data[train_data['is_humor'] == 0]) / len(train_data) * 100

# Ispis rezultata s dvije decimale
print(f"Postotak humorističnih tekstova u train setu: {humor_percent:.2f}%")
print(f"Postotak nehumorističnih tekstova u train setu: {non_humor_percent:.2f}%")



Postotak humorističnih tekstova u train setu: 61.66%
Postotak nehumorističnih tekstova u train setu: 38.34%


In [13]:
dev_data

Unnamed: 0,id,text,is_humor,humor_rating,humor_controversy,offense_rating,sentence_length
1864,1865,Me: What are my chances doc? Doctor: The surge...,1,2.40,1.0,0.00,32
7235,7236,Why do fish live in salt water? Because pepper...,1,2.60,1.0,0.00,13
2687,2688,"Family, we appreciate your patience. Due to fu...",0,,,0.00,46
1454,1455,John F. Kennedy's brain has been missing for 5...,1,1.27,0.0,1.55,10
7830,7831,"""Blueberry juice boosts memory""",0,,,0.05,4
...,...,...,...,...,...,...,...
1803,1804,On a daily basis some young gay guys get HIV t...,0,,,0.00,51
4749,4750,"just had a redbull, feelin' good, energetic, m...",0,,,0.00,22
2140,2141,We would like to remind you that registration ...,0,,,0.20,22
4140,4141,I'm a big fan of people being exactly who they...,0,,,0.00,16


In [14]:
# Broj humorističnih tekstova u dev setu
humor_percent = len(dev_data[dev_data['is_humor'] == 1]) / len(dev_data) * 100

# Broj nehumorističnih tekstova u dev setu
non_humor_percent = len(dev_data[dev_data['is_humor'] == 0]) / len(dev_data) * 100

# Ispis rezultata s dvije decimale
print(f"Postotak humorističnih tekstova u dev setu: {humor_percent:.2f}%")
print(f"Postotak nehumorističnih tekstova u dev setu: {non_humor_percent:.2f}%")


Postotak humorističnih tekstova u dev setu: 61.62%
Postotak nehumorističnih tekstova u dev setu: 38.38%
