# Sentiment Analysis with Scikit-Learn

### Dataset souce: 
https://ermlab.com/en/blog/nlp/polish-sentiment-analysis-using-keras-and-word2vec/

Google Drive link: https://drive.google.com/file/d/1vXqUEBjUHGGy3vV2dA7LlvBjjZlQnl0D/view

### More materials
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Some XGBoost for tabular data: https://www.kaggle.com/code/stuarthallows/using-xgboost-with-scikit-learn/notebook 

To deal with imbalanced data: https://github.com/scikit-learn-contrib/imbalanced-learn

## Read and clean data
More info in data_reading.ipynb notebook :) 

In [None]:
import pandas as pd

file_path = 'polish_sentiment_dataset.csv'

df = pd.read_csv(file_path)

# remove rows with 0 and length column
df = df[df['rate'] != 0]
df = df.drop('length', axis=1)
df = df.dropna(axis=0)

In [None]:
# check data
df.sample(10)

In [None]:
for x in df[df['rate'] == -1]['description'].sample(5):
    print(x)

In [None]:
# limit data - only to speed-up model training
df = df.sample(10000, random_state=123)

X = df['description']
y = df['rate']

In [None]:
y.value_counts()

# Divide data into test and training

In [None]:
from sklearn.model_selection import train_test_split

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

In [None]:
len(X_train) == len(y_train)

In [None]:
X_train.shape

# Vectorize text

We will use [TF-IDF](https://monkeylearn.com/blog/what-is-tf-idf/)

In [None]:
from stop_words import get_stop_words

# https://pypi.org/project/stop-words/
stop_words = get_stop_words('pl')
stop_words[:10]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# create vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words, lowercase=True, max_features=3000)

# fit_transform = fit + transform 
X_train = vectorizer.fit_transform(X_train)
# transform 
X_test = vectorizer.transform(X_test)

In [None]:
X_train.shape

# Classifier 

We will use SVM 

In [None]:
from sklearn.svm import SVC

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
classifier = SVC(probability=True)

# training :) 
classifier.fit(X_train, y_train)

# Model evaluation 
Check how our model performs and predict some labels

In [None]:
# on training data
# Return the mean accuracy on the given test data and labels.
classifier.score(X_train, y_train)

In [None]:
# on test data
classifier.score(X_test, y_test)

In [None]:
# check confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, classifier.predict(X_test))
ConfusionMatrixDisplay(cm, display_labels=classifier.classes_).plot()

In [None]:
# some predictions 1 positive, -1 negative (hejt speach)
x = vectorizer.transform(['zamknij lodówkę'])
classifier.predict(x)

In [None]:
classifier.predict_proba(x)

# Save model
https://scikit-learn.org/stable/model_persistence.html

In [None]:
from joblib import dump, load


# save to file
dump(classifier, 'classifier.joblib') 

In [None]:
# load from file
clf = load('classifier.joblib')

In [None]:
# check predictions for original and saved model 
classifier.predict_proba(x) == clf.predict_proba(x)