# Model Definitions
Here is where the model is being trained and tested. It will then be exported to be served through an API as a service!

In [15]:
# Basic setup
from os import path, getcwd
DIR_PATH = getcwd()

import sys
sys.path.append(path.join(DIR_PATH, "../"))
DATASET_PATH = path.join(DIR_PATH, "../data/news_labelled.csv")

### Step 1: Loading the Dataset
After generating the dataset file with `dataset_builder.py` utility script, labels (`'positive'`, `'negative'` and `'neutral'`) have been **manually** added to act as training data and a reference for the model.

In [16]:
from pandas import read_csv
df = read_csv(DATASET_PATH)
print(df.head())

                                             content     label
0  China is preparing for one of the most anticip...  negative
1  Do you have a package coming your way from ove...  negative
2  China just hosted the first-ever World Humanoi...  positive
3  Wales knew that the opening game against Scotl...  negative
4  Kate Cross has been left out of England's squa...   neutral


### Step 2: Apply Preprocessing
Cleaning the data to remove ambiguities and risk of errors.

In [17]:
from utils.preprocessing import clean_text
df["processed"] = df["content"].astype(str).apply(clean_text)

### Step 3: Convert Text to Numerical values
This makes the data comprehensible to the system, allowing it to extract patterns. This will be achieved by using `TfidVectorizer`.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# Maps data onto labels
X = vectorizer.fit_transform(df["processed"])
y = df["label"]

### Step 4: Train/Test Split
Splitting data to find patterns.