<b>1. Konfiguracija</b>

Definisemo kolone i mjenjamo vrijednosti sentimeta radi jednostavnosti. 

In [None]:
import pandas as pd

column_names = ['sentiment', 'id', 'date', 'query', 'user', 'text']

df = pd.read_csv('training.1600000.processed.noemoticon.csv',
                 encoding='latin-1',
                 header=None,
                 names=column_names)

print("Originalni podaci:")
display(df.head())

df = df[['sentiment', 'text']]

df['sentiment'] = df['sentiment'].replace(4, 1)

print("\nPodaci nakon sređivanja:")
display(df.head())

print("\nDistribucija sentimenta:")
print(df['sentiment'].value_counts())

Originalni podaci:


Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."



Podaci nakon sređivanja:


Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."



Distribucija sentimenta:
sentiment
0    800000
1    800000
Name: count, dtype: int64


<b>2. Čišćenje teksta. </b>

Definišemo metodu za čišćenje teksta koju primjenjujemo na kolonu tekst. Metoda ocisti_tekst uklanja neke specifične sekvence iz teksta.

In [None]:
import re

def ocisti_tekst(tekst):
    tekst = re.sub(r'https?://\S+|www\.\S+', '', tekst)
    
    tekst = re.sub(r'@\w+', '', tekst)
    
    tekst = re.sub(r'[^\w\s]', '', tekst)
    
    tekst = tekst.lower()
    return tekst

print("Tekst PRE čišćenja:")
display(df.head())

df['text'] = df['text'].apply(ocisti_tekst)

print("\nTekst NAKON čišćenja:")
display(df.head())

Tekst PRE čišćenja:


Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."



Tekst NAKON čišćenja:


Unnamed: 0,sentiment,text
0,0,awww thats a bummer you shoulda got david ...
1,0,is upset that he cant update his facebook by t...
2,0,i dived many times for the ball managed to sa...
3,0,my whole body feels itchy and like its on fire
4,0,no its not behaving at all im mad why am i he...


<b>3. Vektorizacija i transformacija</b>

Dijelimo podatke na 80% za trening i 20% za testiranje. Kreiramo TF-IDF Vektorizator i koristicemo 10000 najčešćih riječi kako ne bi smo preopteretili procesor.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Veličina trening seta:", X_train.shape)
print("Veličina test seta:", X_test.shape)
print("---")

vectorizer = TfidfVectorizer(max_features=10000)

X_train_tfidf = vectorizer.fit_transform(X_train)

X_test_tfidf = vectorizer.transform(X_test)

print("Veličina vektorizovanog trening seta:", X_train_tfidf.shape)
print("Veličina vektorizovanog test seta:", X_test_tfidf.shape)

Veličina trening seta: (1280000,)
Veličina test seta: (320000,)
---
Veličina vektorizovanog trening seta: (1280000, 10000)
Veličina vektorizovanog test seta: (320000, 10000)


<b>4. Treniranje i izrada modela</b>

Uvozimo model LogisticRegression koji koristimo. Testiramo model na testnim podacima i vršimo evaluaciju te prikazujemo izvještaj o performansama.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

print("Počinje treniranje modela... Ovo može potrajati minut-dva.")

model = LogisticRegression(max_iter=1000) 

model.fit(X_train_tfidf, y_train)

print("Model je uspešno istreniran!")
print("---")


print("Počinje evaluacija modela na test podacima...")

y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print(f"Tačnost modela je: {accuracy * 100:.2f}%")
print("---")

print("Detaljan izveštaj klasifikacije:")
print(classification_report(y_test, y_pred))

Počinje treniranje modela... Ovo može potrajati minut-dva.
Model je uspešno istreniran!
---
Počinje evaluacija modela na test podacima...
Tačnost modela je: 79.48%
---
Detaljan izveštaj klasifikacije:
              precision    recall  f1-score   support

           0       0.80      0.78      0.79    159494
           1       0.79      0.81      0.80    160506

    accuracy                           0.79    320000
   macro avg       0.80      0.79      0.79    320000
weighted avg       0.79      0.79      0.79    320000



<b>5. Čuvanje modela i vektorizatora</b>

In [None]:
import joblib

joblib.dump(vectorizer, 'sentiment_vectorizer.joblib')

joblib.dump(model, 'sentiment_model.joblib')

print("Model i vektorizator su uspešno sačuvani u fajlove!")

Model i vektorizator su uspešno sačuvani u fajlove!
