# 1. Introduction
**FastText** is an open-source, text classification library created by **Facebook's AI Research (FAIR)** team. It provides an efficient and fast way to classify text data.

A *FastText News Classifier* is a machine learning model that leverages **FastText** to classify news articles into different categories (such as sports, politics, entertainment, etc.).

**Thanks to:**
* [E-commerce Text Classification Using FastText](https://www.kaggle.com/code/sunilthite/e-commerce-text-classification-using-fasttext/input)
* [Loading Kaggle data directly into Google Colab](https://www.youtube.com/watch?v=yEXkEUqK52Q)
* [Importing Datasets from Kaggle to Google Colab](https://saturncloud.io/blog/importing-datasets-from-kaggle-to-google-colab/)
* [kaggle dataset download 403 Forbidden](https://stackoverflow.com/questions/75569191/kaggle-dataset-download-403-forbidden)
* [Traffic Accident in Indonesia](https://www.kaggle.com/datasets/dodyagung/accident)
* [Jigsaw Regression Based Data](https://www.kaggle.com/datasets/nkitgupta/jigsaw-regression-based-data)

# 2. Import libraries

In [None]:
!pip install fasttext



In [None]:
# FastText library for text classification and training the FastText model.
import fasttext

# pandas library for data manipulation.
import pandas as pd

# 3. Load the dataset
Loads the dataset from a CSV file into a DataFrame for easy manipulation

In [None]:
# Kaggle library to download datasets from Kaggle
!pip install kaggle



In [None]:
# Mount Google Drive: Import the Drive to access and store the Kaggle API key in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Set Kaggle Configuration: To direct Kaggle to the appropriate directory in Drive
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'

In [None]:
# copy the Kaggle api key to the google drive
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
# give the permission to the JSON file
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download the dataset from Kaggle using the Kaggle API.

# !kaggle datasets download -d nkitgupta/jigsaw-regression-based-data
# !unzip /content/jigsaw-regression-based-data.zip
# !rm /content/jigsaw-regression-based-data.zip

!kaggle datasets download -d dodyagung/accident
!unzip /content/accident.zip
!rm /content/accident.zip

Dataset URL: https://www.kaggle.com/datasets/dodyagung/accident
License(s): CC-BY-NC-SA-4.0
accident.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/accident.zip
replace twitter.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


## 3.1 Explore the dataset

In [None]:
df = pd.read_csv('/content/twitter.csv')
df.head(10)

Unnamed: 0,id_str,created_at,crawled_at,screen_name,full_text,full_tweet
0,1113812743138697216,2019-04-04 21:37:06,2020-02-08 12:30:13,vtvindonesia,Pelajar SMP Tewas Kecelakaan di Jalinsum Banda...,"{""created_at"": ""Thu Apr 04 14:37:06 +0000 2019..."
1,1113813804708548608,2019-04-04 21:41:19,2020-02-08 12:30:13,briand_fergie,Orang-orang pulang nonton film ini langsung ga...,"{""created_at"": ""Thu Apr 04 14:41:19 +0000 2019..."
2,1113823718956908545,2019-04-04 22:20:43,2020-02-08 12:30:13,dimanamacetid,[22:14] #JAKARTA #KECELAKAAN Rawamangun #TMC,"{""created_at"": ""Thu Apr 04 15:20:43 +0000 2019..."
3,1113846220986900480,2019-04-04 23:50:08,2020-02-08 12:30:14,imaamhanavi,[83:1] Kecelakaan besarlah bagi orang-orang ya...,"{""created_at"": ""Thu Apr 04 16:50:08 +0000 2019..."
4,1113847850801090560,2019-04-04 23:56:36,2020-02-08 12:30:14,mdntkptr,"Anggapannya kayak mobil vs motor kecelakaan, y...","{""created_at"": ""Thu Apr 04 16:56:36 +0000 2019..."
5,1113848173166977024,2019-04-04 23:57:53,2020-02-08 12:30:15,serambinews,#TrukTangki\n\n#Kecelakaan\n#Evakuasi\n#NaganR...,"{""created_at"": ""Thu Apr 04 16:57:53 +0000 2019..."
6,1113849680297775105,2019-04-05 00:03:52,2020-02-08 12:30:15,DeveessS,Aq benci laki2 :'( \nAku benci hidup ini\nKnp ...,"{""created_at"": ""Thu Apr 04 17:03:52 +0000 2019..."
7,1113849914788679680,2019-04-05 00:04:48,2020-02-08 12:30:15,TMCPoldaMetro,"Gunakan jalur sesuai ketentuan, jangan melawan...","{""created_at"": ""Thu Apr 04 17:04:48 +0000 2019..."
8,1113851311621959681,2019-04-05 00:10:21,2020-02-08 12:30:15,dimanamacetid,[00:04] #JAKARTA #KECELAKAAN Jl. Raya Kembanga...,"{""created_at"": ""Thu Apr 04 17:10:21 +0000 2019..."
9,1113858337085374469,2019-04-05 00:38:16,2020-02-08 12:30:16,TimmyChucker,"korban penipuan gak sakit, korban kecelakaan j...","{""created_at"": ""Thu Apr 04 17:38:16 +0000 2019..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157629 entries, 0 to 157628
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id_str       157629 non-null  int64 
 1   created_at   157629 non-null  object
 2   crawled_at   157629 non-null  object
 3   screen_name  157629 non-null  object
 4   full_text    157629 non-null  object
 5   full_tweet   157629 non-null  object
dtypes: int64(1), object(5)
memory usage: 7.2+ MB


# 4. Preprocess the dataset

In [None]:
X = df['full_text']
X.head(10)

Unnamed: 0,full_text
0,Pelajar SMP Tewas Kecelakaan di Jalinsum Banda...
1,Orang-orang pulang nonton film ini langsung ga...
2,[22:14] #JAKARTA #KECELAKAAN Rawamangun #TMC
3,[83:1] Kecelakaan besarlah bagi orang-orang ya...
4,"Anggapannya kayak mobil vs motor kecelakaan, y..."
5,#TrukTangki\n\n#Kecelakaan\n#Evakuasi\n#NaganR...
6,Aq benci laki2 :'( \nAku benci hidup ini\nKnp ...
7,"Gunakan jalur sesuai ketentuan, jangan melawan..."
8,[00:04] #JAKARTA #KECELAKAAN Jl. Raya Kembanga...
9,"korban penipuan gak sakit, korban kecelakaan j..."
