# 1. Introduction
**FastText** is an open-source, text classification library created by **Facebook's AI Research (FAIR)** team. It provides an efficient and fast way to classify text data.

A *FastText News Classifier* is a machine learning model that leverages **FastText** to classify news articles into different categories (such as sports, politics, entertainment, etc.).

**Thanks to:**
* [E-commerce Text Classification Using FastText](https://www.kaggle.com/code/sunilthite/e-commerce-text-classification-using-fasttext/input)
* [Loading Kaggle data directly into Google Colab](https://www.youtube.com/watch?v=yEXkEUqK52Q)
* [Importing Datasets from Kaggle to Google Colab](https://saturncloud.io/blog/importing-datasets-from-kaggle-to-google-colab/)
* [kaggle dataset download 403 Forbidden](https://stackoverflow.com/questions/75569191/kaggle-dataset-download-403-forbidden)
* [Traffic Accident in Indonesia](https://www.kaggle.com/datasets/dodyagung/accident)
* [Jigsaw Regression Based Data](https://www.kaggle.com/datasets/nkitgupta/jigsaw-regression-based-data)

# 2. Import libraries

In [1]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4296187 sha256=8fc6f54fb34ee7dea6f2068d94761cb2c7b216639aa4a729c141da5eb4315565
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58

In [27]:
# FastText library for text classification and training the FastText model.
import fasttext

# pandas library for data manipulation.
import pandas as pd

# import regular expression
import re

from gensim.models import FastText

from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split, KFold

# 3. Load the dataset
Loads the dataset from a CSV file into a DataFrame for easy manipulation

In [3]:
# Kaggle library to download datasets from Kaggle
!pip install kaggle



In [5]:
# Mount Google Drive: Import the Drive to access and store the Kaggle API key in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
# Set Kaggle Configuration: To direct Kaggle to the appropriate directory in Drive
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'

In [8]:
# copy the Kaggle api key to the google drive
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/kaggle.json

In [9]:
# give the permission to the JSON file
!chmod 600 ~/.kaggle/kaggle.json

In [10]:
# download the dataset from Kaggle using the Kaggle API.

# !kaggle datasets download -d nkitgupta/jigsaw-regression-based-data
# !unzip /content/jigsaw-regression-based-data.zip
# !rm /content/jigsaw-regression-based-data.zip

!kaggle datasets download -d dodyagung/accident
!unzip /content/accident.zip
!rm /content/accident.zip

Dataset URL: https://www.kaggle.com/datasets/dodyagung/accident
License(s): CC-BY-NC-SA-4.0
Downloading accident.zip to /content
100% 313M/313M [00:03<00:00, 120MB/s]
100% 313M/313M [00:03<00:00, 106MB/s]
Archive:  /content/accident.zip
  inflating: twitter.csv             
  inflating: twitter_label_auto.csv  
  inflating: twitter_label_manual.csv  


## 3.1 Explore the dataset

In [11]:
df = pd.read_csv('/content/twitter.csv')
df.head()

Unnamed: 0,id_str,created_at,crawled_at,screen_name,full_text,full_tweet
0,1113812743138697216,2019-04-04 21:37:06,2020-02-08 12:30:13,vtvindonesia,Pelajar SMP Tewas Kecelakaan di Jalinsum Banda...,"{""created_at"": ""Thu Apr 04 14:37:06 +0000 2019..."
1,1113813804708548608,2019-04-04 21:41:19,2020-02-08 12:30:13,briand_fergie,Orang-orang pulang nonton film ini langsung ga...,"{""created_at"": ""Thu Apr 04 14:41:19 +0000 2019..."
2,1113823718956908545,2019-04-04 22:20:43,2020-02-08 12:30:13,dimanamacetid,[22:14] #JAKARTA #KECELAKAAN Rawamangun #TMC,"{""created_at"": ""Thu Apr 04 15:20:43 +0000 2019..."
3,1113846220986900480,2019-04-04 23:50:08,2020-02-08 12:30:14,imaamhanavi,[83:1] Kecelakaan besarlah bagi orang-orang ya...,"{""created_at"": ""Thu Apr 04 16:50:08 +0000 2019..."
4,1113847850801090560,2019-04-04 23:56:36,2020-02-08 12:30:14,mdntkptr,"Anggapannya kayak mobil vs motor kecelakaan, y...","{""created_at"": ""Thu Apr 04 16:56:36 +0000 2019..."


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157629 entries, 0 to 157628
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id_str       157629 non-null  int64 
 1   created_at   157629 non-null  object
 2   crawled_at   157629 non-null  object
 3   screen_name  157629 non-null  object
 4   full_text    157629 non-null  object
 5   full_tweet   157629 non-null  object
dtypes: int64(1), object(5)
memory usage: 7.2+ MB


# 4. Preprocess the dataset

In [13]:
X = df['full_text']
X.head()

Unnamed: 0,full_text
0,Pelajar SMP Tewas Kecelakaan di Jalinsum Banda...
1,Orang-orang pulang nonton film ini langsung ga...
2,[22:14] #JAKARTA #KECELAKAAN Rawamangun #TMC
3,[83:1] Kecelakaan besarlah bagi orang-orang ya...
4,"Anggapannya kayak mobil vs motor kecelakaan, y..."


## 4.1 removing special characters, lowercasing, tokenization, stopword removal, etc
[re — Regular expression operations](https://docs.python.org/3/library/re.html)

In [14]:
# Preprocessing function to clean the text
def preprocess_text(text):

  # Remove special characters and numbers
  # re.A (ASCII-only matching)
  # re.I (ignore case)
  text = re.sub(r'[^a-zA-Z\s]', '', text, re.I | re.A)

  # Convert to lowercase
  text = text.lower()

  # Remove extra spaces
  text = text.strip()
  return text

In [15]:
# Apply preprocessing to the 'full_text' column
df['cleaned_text'] = df['full_text'].apply(preprocess_text)

In [16]:
# Check cleaned text
df['cleaned_text'].head()

Unnamed: 0,cleaned_text
0,pelajar smp tewas kecelakaan di jalinsum banda...
1,orangorang pulang nonton film ini langsung gal...
2,jakarta kecelakaan rawamangun tmc
3,kecelakaan besarlah bagi orangorang yang curang
4,anggapannya kayak mobil vs motor kecelakaan yg...


# 5. Perform word embedding using **"FastText"**


In [17]:
# Tokenize the cleaned text for FastText model training
df['tokenized_text'] = df['cleaned_text'].apply(lambda x: x.split())
df['tokenized_text'].head()

Unnamed: 0,tokenized_text
0,"[pelajar, smp, tewas, kecelakaan, di, jalinsum..."
1,"[orangorang, pulang, nonton, film, ini, langsu..."
2,"[jakarta, kecelakaan, rawamangun, tmc]"
3,"[kecelakaan, besarlah, bagi, orangorang, yang,..."
4,"[anggapannya, kayak, mobil, vs, motor, kecelak..."


[models.fasttext – FastText model](https://radimrehurek.com/gensim/models/fasttext.html)

In [18]:
# Train FastText model using the tokenized text
fasttext_model = FastText(sentences=df['tokenized_text'], vector_size=1, window=1, min_count=1, sg=1, epochs=1)

In [19]:
# Create a function to get average word vectors for each document
def get_avg_word_vector(text, model):

    # Filter out words not in the model's vocabulary
    words = [word for word in text if word in model.wv]
    if len(words) > 0:
        return np.mean(model.wv[words], axis=0)
    else:
        return np.zeros(model.vector_size)

In [20]:
# Apply the function to get average FastText embeddings for each document
df['embedding'] = df['tokenized_text'].apply(lambda x: get_avg_word_vector(x, fasttext_model))

In [21]:
# Check embedding results
df['embedding'].head()

Unnamed: 0,embedding
0,[-5.5425873]
1,[-5.5031557]
2,[-5.5433254]
3,[-6.2946925]
4,[-5.593171]


In [22]:
# Prepare data for classification (dropping rows with empty embeddings)
X = np.array(df['embedding'].tolist())
y = np.random.randint(2, size=X.shape[0])  # Random binary labels for demo, replace with actual labels if available

In [25]:
# Encode labels if necessary
le = LabelEncoder()
y = le.fit_transform(y)

In [28]:
# Split data for cross-validation
kf = KFold(n_splits=5)