# Model Definitions
Here is where the model is being trained and tested. It will then be exported to be served through an API as a service!

In [15]:
# Basic setup
from os import path, getcwd
DIR_PATH = getcwd()

import sys
sys.path.append(path.join(DIR_PATH, "../"))
DATASET_PATH = path.join(DIR_PATH, "../data/news_labelled.csv")

### Step 1: Loading the Dataset
After generating the dataset file with `dataset_builder.py` utility script, labels (`'positive'`, `'negative'` and `'neutral'`) have been **manually** added to act as training data and a reference for the model.

In [16]:
from pandas import read_csv
df = read_csv(DATASET_PATH)
print(df.head())

                                             content     label
0  China is preparing for one of the most anticip...  negative
1  Do you have a package coming your way from ove...  negative
2  China just hosted the first-ever World Humanoi...  positive
3  Wales knew that the opening game against Scotl...  negative
4  Kate Cross has been left out of England's squa...   neutral


### Step 2: Apply Preprocessing
Cleaning the data to remove ambiguities and risk of errors.

In [17]:
from utils.preprocessing import clean_text
df["processed"] = df["content"].astype(str).apply(clean_text)

### Step 3: Convert Text to Numerical values
This makes the data comprehensible to the system, allowing it to extract patterns. This will be achieved by using `TfidVectorizer`.

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# Maps data onto labels
X = vectorizer.fit_transform(df["processed"])
y = df["label"]

***NOTE:** Analysing the distribution of data can help us get an idea of what the model will predict.*

In [26]:
print(df["label"].value_counts())

label
neutral     38
negative    34
positive    28
Name: count, dtype: int64


### Step 4: Train/Test Split
Splitting data to find patterns. `test_size = 0.2` means that we are using 20% of the data to train.

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Step 5: Train Model
We will be using a `Logistic Regression` model. Learn more using the following:
- [StatQuest: Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8)
- [Logistic Regression in 3 Minutes](https://www.youtube.com/watch?v=EKm0spFxFG4)
- [Linear Regression vs Logistic Regression - What's The Difference?](https://www.youtube.com/watch?v=06en5XqdPkI)
- [Logistic Regression (and why it's different from Linear Regression)](https://www.youtube.com/watch?v=3bvM3NyMiE0)

In [21]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(X_train, y_train) # Trains model using training data

### Step 6: Evaluate the Model
Giving the model a score on how well it performed.

In [33]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)
print(f"Accuracy:\n{accuracy_score(y_test, y_pred)}\n")
print(f"Classification Report:\n{classification_report(y_test, y_pred, zero_division = 0)}")

Accuracy:
0.45

Classification Report:
              precision    recall  f1-score   support

    negative       0.33      0.17      0.22         6
     neutral       0.47      1.00      0.64         8
    positive       0.00      0.00      0.00         6

    accuracy                           0.45        20
   macro avg       0.27      0.39      0.29        20
weighted avg       0.29      0.45      0.32        20



### Step 7: Export Model
Saving the **trained** model.

In [34]:
import joblib
joblib.dump(model, "../models/sentiment_model.pkl")
joblib.dump(vectorizer, "../models/vectorizer.pkl")

['../models/vectorizer.pkl']

### Step 8: Optional Testing
Testing if exported model works properly.

In [35]:
test_text = "Stock markets has crashed"
X_new = vectorizer.transform([test_text])
print("Prediction:", model.predict(X_new)[0])

Prediction: negative
