# Eksploratorn analiza

Problem kod dataseta u ovom zadatku je što imamo pristup samo train set-u, nemamo validacijskom i test setu. Broj redova, primjera u train setu je 8000 te sam podijelio podatke na sljedeći način: 6400 train set, 800 dev set i 800 test set.

In [1]:
import pandas as pd

### Puni dataset

In [2]:
data = pd.read_csv('data/dataset.csv')
data

Unnamed: 0,id,text,is_humor,humor_rating,humor_controversy,offense_rating
0,1,TENNESSEE: We're the best state. Nobody even c...,1,2.42,1.0,0.20
1,2,A man inserted an advertisement in the classif...,1,2.50,1.0,1.10
2,3,How many men does it take to open a can of bee...,1,1.95,0.0,2.40
3,4,Told my mom I hit 1200 Twitter followers. She ...,1,2.11,1.0,0.00
4,5,Roses are dead. Love is fake. Weddings are bas...,1,2.78,0.0,0.10
...,...,...,...,...,...,...
7995,7996,Lack of awareness of the pervasiveness of raci...,0,,,0.25
7996,7997,Why are aspirins white? Because they work sorry,1,1.33,0.0,3.85
7997,7998,"Today, we Americans celebrate our independence...",1,2.55,0.0,0.00
7998,7999,How to keep the flies off the bride at an Ital...,1,1.00,0.0,3.00


Značenje stupaca:

    - id - Ovo je identifikacijski broj za svaku recenicu. Može se koristiti za jedinstveno identificiranje svake stavke u skupu podataka.
    - text - Ovaj stupac sadrži rečenice koje je potrebno analizirati.
    - is_humor - inarna oznaka (0 ili 1) koja označava ima li rečenica humor ili ne. Ako je vrijednost 1, rečenica je označena kao humoristična, ako je 0, rečenica nije .
    - humor_rating - Numerička ocjena (1-5) koja predstavlja subjektivnu percepciju anotatora o tome koliko je rečenica smiješna. Anotatori su ocijenili smiješnost rečenice na skali od 1 do 5.
    - humor_controversy - Binarna oznaka (0 ili 1) koja označava ima li kontroverzu humora u rečenici. Ako je vrijednost 1, to znači da je ocjena humora za tu rečenicu kontroverzna.
    - offense_rating - Numerička ocjena (1-5) koja predstavlja subjektivnu percepciju anotatora o tome koliko je rečenica uvredljiva. Anotatori su ocijenili razinu uvredljivosti rečenice na skali od 1 do 5. Ovdje se također razmatra da nedavanje ocjene jednako 0.

In [3]:
print(data.describe())
print()
print()
print(f"Broj humoristicnih tekstova: {len(data[data['is_humor'] == 1])}")
print(f"Broj ne humoristicnih: {len(data[data['is_humor'] == 0])}")
print(f"Broj NaN zapisa: {len(data[data['is_humor'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['humor_rating'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['humor_controversy'].isna()])}")
print(f"Broj NaN zapisa: {len(data[data['offense_rating'].isna()])}")

               id     is_humor  humor_rating  humor_controversy  \
count  8000.00000  8000.000000   4932.000000        4932.000000   
mean   4000.50000     0.616500      2.260525           0.499797   
std    2309.54541     0.486269      0.566974           0.500051   
min       1.00000     0.000000      0.100000           0.000000   
25%    2000.75000     0.000000      1.890000           0.000000   
50%    4000.50000     1.000000      2.280000           0.000000   
75%    6000.25000     1.000000      2.650000           1.000000   
max    8000.00000     1.000000      4.000000           1.000000   

       offense_rating  
count     8000.000000  
mean         0.585325  
std          0.979955  
min          0.000000  
25%          0.000000  
50%          0.100000  
75%          0.700000  
max          4.850000  


Broj humoristicnih tekstova: 4932
Broj ne humoristicnih: 3068
Broj NaN zapisa: 0
Broj NaN zapisa: 3068
Broj NaN zapisa: 3068
Broj NaN zapisa: 0


In [4]:
# Provjerava ima li unos u svakom redu za 'text' stupac
text_column_not_null = data['text'].dropna()

# Ispisuje duljinu rezultirajućeg DataFrame-a
print(f"Broj redaka bez NaN vrijednosti u 'text' stupcu: {len(text_column_not_null)}")

Broj redaka bez NaN vrijednosti u 'text' stupcu: 8000


In [5]:
# Udio kontroverznosti humora
controversial_count = data['humor_controversy'].sum()
total_samples = len(data)

print(f"Udio kontroverznosti humora: {controversial_count / total_samples * 100:.2f}%")

Udio kontroverznosti humora: 30.81%


In [6]:
# Analiza duljine rečenica
data['sentence_length'] = data['text'].apply(lambda x: len(x.split()))
print(data[['text', 'sentence_length']].head())
print(data[['sentence_length']].mean())
print(data.groupby('is_humor')['sentence_length'].mean())

                                                text  sentence_length
0  TENNESSEE: We're the best state. Nobody even c...               17
1  A man inserted an advertisement in the classif...               32
2  How many men does it take to open a can of bee...               26
3  Told my mom I hit 1200 Twitter followers. She ...               26
4  Roses are dead. Love is fake. Weddings are bas...               12
sentence_length    20.889375
dtype: float64
is_humor
0    21.932855
1    20.240268
Name: sentence_length, dtype: float64


Training set ne sadrži neispravne primjere. Gdje su vrijednosti is_humor == 0, tj. za tekstove koji nisu humoristični nema vrijednosti humor_rating	i humor_controversy jer to za njih niti nije moguće izračunati.

### Podjela dataset-a

In [7]:
from sklearn.model_selection import train_test_split

# Podijeli train set na train i privremeni set (ostatak)
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=42)

# Podijeli privremeni set na dev i test set
dev_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Ispisuje veličine dobivenih setova
print(f"Veličina train seta: {len(train_data)}")
print(f"Veličina dev seta: {len(dev_data)}")
print(f"Veličina test seta: {len(test_data)}")

Veličina train seta: 6400
Veličina dev seta: 800
Veličina test seta: 800


In [8]:
# Spremi train set u CSV file
train_data.to_csv('data/train.csv', index=False)

# Spremi dev set u CSV file
dev_data.to_csv('data/dev.csv', index=False)

# Spremi test set u CSV file
test_data.to_csv('data/test.csv', index=False)

### Odnos humorističnih i nehumorističnih tekstova u train i dev setu

In [9]:
train_data


Unnamed: 0,id,text,is_humor,humor_rating,humor_controversy,offense_rating,sentence_length
1467,1468,customer: i'd like to return this boomerang me...,1,2.30,1.0,0.00,14
5768,5769,"Keep your ears ready for next week, when The A...",0,,,0.50,18
5714,5715,[2 am] *5 year old sneaks into my room* 5: (wh...,1,3.00,0.0,0.00,36
1578,1579,Sex and food activate the same parts of the br...,0,,,0.10,10
6958,6959,"Gay or straight, No state should legally recog...",1,2.25,1.0,0.30,18
...,...,...,...,...,...,...,...
5226,5227,What is a pirates favorite letter? P. Because ...,1,3.00,1.0,0.15,13
5390,5391,My parents just said they want another child. ...,1,2.60,1.0,0.15,21
860,861,Don't depend too much on anyone in this world ...,0,,,0.00,21
7603,7604,"When my toddler gets upset, he does this stomp...",1,2.55,0.0,0.10,33


In [10]:
# Broj humorističnih tekstova u train setu
humor_percent = len(train_data[train_data['is_humor'] == 1]) / len(train_data) * 100

# Broj nehumorističnih tekstova u train setu
non_humor_percent = len(train_data[train_data['is_humor'] == 0]) / len(train_data) * 100

# Ispis rezultata s dvije decimale
print(f"Postotak humorističnih tekstova u train setu: {humor_percent:.2f}%")
print(f"Postotak nehumorističnih tekstova u train setu: {non_humor_percent:.2f}%")


Postotak humorističnih tekstova u train setu: 61.69%
Postotak nehumorističnih tekstova u train setu: 38.31%


In [11]:
dev_data

Unnamed: 0,id,text,is_humor,humor_rating,humor_controversy,offense_rating,sentence_length
1606,1607,Weird situation based questions are great open...,1,1.80,1.0,0.10,33
2094,2095,"Joseph confronts Mary... Joseph: ""Mary, I've h...",1,2.80,1.0,2.05,28
1034,1035,Why do the French like to eat snails so much? ...,1,2.79,0.0,0.25,15
7463,7464,"""When we're told not to touch something we usu...",0,,,0.00,24
5363,5364,Did you hear about the Native American who dra...,1,1.65,0.0,1.80,19
...,...,...,...,...,...,...,...
5551,5552,[gym] Personal Trainer: (looking at my workout...,1,2.22,0.0,0.00,26
6334,6335,I just got a ticket for driving while wearing ...,1,2.25,1.0,0.10,28
5103,5104,If I learned anything from Forest Gump it's th...,1,2.00,0.0,3.35,16
2264,2265,"""Gangsta's Paradise"" has no profanity in it be...",0,,,0.00,21


In [12]:
# Broj humorističnih tekstova u dev setu
humor_percent = len(dev_data[dev_data['is_humor'] == 1]) / len(dev_data) * 100

# Broj nehumorističnih tekstova u dev setu
non_humor_percent = len(dev_data[dev_data['is_humor'] == 0]) / len(dev_data) * 100

# Ispis rezultata s dvije decimale
print(f"Postotak humorističnih tekstova u dev setu: {humor_percent:.2f}%")
print(f"Postotak nehumorističnih tekstova u dev setu: {non_humor_percent:.2f}%")


Postotak humorističnih tekstova u dev setu: 63.38%
Postotak nehumorističnih tekstova u dev setu: 36.62%


## Baseline model

In [42]:
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer

from gensim.models import Word2Vec

from sklearn.metrics import classification_report, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.metrics import f1_score, accuracy_score
import numpy as np

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gasparko/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/gasparko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gasparko/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### Pre-process the data

In order to use Word2Vec, you need to pre-process the data. It's very simple: you just need to split sentences to words (tokenization), bring the words to their basic form (lemmatization), and remove some very common words like articles or prepositions (stop-word removal). I'm using RegexpTokenizer, WordNetLemmatizer and NLTK stop word list

In [102]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [103]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.lower() not in stop_words]
    return ' '.join(tokens)

In [104]:
data['processed_text'] = data['text'].apply(preprocess_text)

In [106]:
X_train, X_test, y_train, y_test = train_test_split(
    data['processed_text'], 
    data['is_humor'], 
    test_size=0.2, 
    random_state=42
)

In [107]:
# Word2Vec model training
word2vec_model = Word2Vec(sentences=X_train.apply(word_tokenize), vector_size=100, window=5, min_count=1, workers=4)

In [108]:
# Function to average word vectors for a sentence
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype="float32")
    n_words = 0
    for word in words:
        if word in vocabulary:
            n_words += 1
            feature_vector = np.add(feature_vector, model.wv[word])
    if n_words:
        feature_vector = np.divide(feature_vector, n_words)
    return feature_vector

In [109]:
# Transform text data to Word2Vec features
def word2vec_features(data, model, num_features):
    vocabulary = set(model.wv.index_to_key)
    return np.vstack([average_word_vectors(tokens, model, vocabulary, num_features) for tokens in data.apply(word_tokenize)])

In [110]:
# SVC model
svc_model = SVC()

In [111]:
# Create a pipeline with Word2Vec and SVC
model_pipeline = Pipeline([
    ('word2vec', FunctionTransformer(lambda x: word2vec_features(x, word2vec_model, 100))),
    ('svc', svc_model)
])

In [112]:
# Train the model
model_pipeline.fit(X_train, y_train)

In [132]:
# Evaluate the model
predictions = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
class_report = classification_report(y_test, predictions)
print(f"Classification report: {class_report}")
print(f"Accuracy: {accuracy:.2f}")
print(f"F1 Score: {f1:.2f}")

Classification report:               precision    recall  f1-score   support

           0       0.64      0.12      0.21       616
           1       0.64      0.96      0.76       984

    accuracy                           0.64      1600
   macro avg       0.64      0.54      0.49      1600
weighted avg       0.64      0.64      0.55      1600

Accuracy: 0.64
F1 Score: 0.76


In [114]:
y_test

2215    1
2582    1
1662    1
3027    0
4343    1
       ..
1079    0
7979    1
1115    0
6093    1
6832    1
Name: is_humor, Length: 1600, dtype: int64

In [129]:
count_zeros = (predictions == 0).sum()
count_ones = (predictions == 1).sum()

print(f"Number of zeros: {count_zeros}")
print(f"Number of ones: {count_ones}")

print(f"percentage of humoruous texts: {count_ones*100 / len(y_test)}%")

Number of zeros: 119
Number of ones: 1481
percentage of humoruous texts: 92.5625%


U train setu je cca 65% humoristicnih tekstova, ovdje je prema prvoj predikciji 93%...Potrebno pogledati splitanje train seta da bude jednaka distribucija u train i test setu

In [130]:
# Ulazni tekst
input_text = "I am so funny. Am I?"

In [117]:
# Primijeniti istu predobradu teksta
processed_input = preprocess_text(input_text)

In [118]:
# Pretvoriti tekst u vektor
input_vector = average_word_vectors(processed_input, word2vec_model, set(word2vec_model.wv.index_to_key), 100)

In [119]:
# Provjeriti oblik vektora (provjeriti dimenzionalnost)
print("Shape of input vector:", input_vector.shape)

Shape of input vector: (100,)


In [120]:
# Naparviti reshape sa 1D u 2D
input_vector_2d = input_vector.reshape(1, -1)
print("Shape of input vector (2D):", input_vector_2d.shape)
input_vector_2d

Shape of input vector (2D): (1, 100)


array([[-4.93949771e-01,  6.76344693e-01, -1.43089518e-03,
         1.51576489e-01,  1.34153767e-02, -7.77530134e-01,
         2.42692754e-01,  1.63301551e+00, -7.11848915e-01,
        -4.97749239e-01, -1.67022109e-01, -7.13571548e-01,
        -1.42553017e-01,  3.13758105e-01,  1.96109593e-01,
        -5.53988516e-01,  3.36742133e-01, -6.04906499e-01,
        -2.39247441e-01, -1.13867033e+00,  1.45526618e-01,
         1.09273016e-01,  4.55832571e-01, -2.19770491e-01,
        -9.96512175e-02,  4.93233986e-02, -4.36265826e-01,
        -6.88045844e-03, -8.08161914e-01,  9.45555791e-02,
         5.17455876e-01, -1.77536950e-01,  1.88559160e-01,
        -8.14770877e-01, -1.58952847e-01,  6.27620459e-01,
         3.07435781e-01, -1.95177361e-01, -2.48350546e-01,
        -9.09284890e-01, -2.24475265e-02, -6.44807577e-01,
        -3.83204907e-01, -1.12460785e-01,  2.39399076e-01,
        -3.85139823e-01, -5.32037973e-01,  1.27896741e-01,
         2.72586793e-01,  5.15354335e-01,  1.88405856e-0

In [122]:
# Ovdje dolazi do greske, treba ispraviti model i vidjeti sto je tocno krivo
# prediction = model_pipeline.predict(input_vector_2d)

# print("Predicted class:", prediction)