<a href="https://colab.research.google.com/github/alexnodejs/bandit/blob/main/HT_21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Завдання

* Завантажте дані
* Розділіть дані на тренувальні та тестові
* У `MyTokenizer` добавте обробку тексту за допомогою `nltk`, застосуйте **стематизацію**
* Застосуйте `TfidfVectorizer`
* Застосуйте `TruncatedSVD`
* Натренуйте модель [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) з параметром `kernel="linear"`, якщо вистачить пам'яті можна спробувати `kernel="rbf"`
* Виведіть результати

# Дані про телеграм

In [5]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split



df = pd.read_csv("https://raw.githubusercontent.com/HalyshAnton/IT-Step-Pyton-AI/main/module6/data/telegram_spam.csv")

In [6]:
X = df['text']
y = df['text_type']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y)

In [7]:
nltk.download('punkt_tab')
nltk.download('stopwords')


class MyTokenizer:
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))

    def __call__(self, text):
        tokens = word_tokenize(text.lower())
        filtered_tokens = [token for token in tokens if token not in self.stop_words]
        stemmed = [self.stemmer.stem(token) for token in filtered_tokens]
        return stemmed

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
tokenizer = MyTokenizer()
tfidf = TfidfVectorizer(tokenizer=tokenizer, max_features=2_000)
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)



In [9]:
from sklearn.decomposition import TruncatedSVD

pca = TruncatedSVD(1745+1)
X_train_transformed = pca.fit_transform(X_train)
X_test_transformed = pca.transform(X_test)

In [10]:
pca.explained_variance_ratio_.sum()

0.984358472756055

In [11]:
pca.explained_variance_ratio_.cumsum()

array([0.02675419, 0.04068718, 0.04989618, ..., 0.98419281, 0.98427575,
       0.98435847])

In [12]:
import numpy as np

np.where(pca.explained_variance_ratio_.cumsum() >= 0.70)

(array([ 579,  580,  581, ..., 1743, 1744, 1745]),)

In [13]:
from sklearn.svm import SVC

model = SVC(kernel="linear")

model.fit(X_train, y_train)
model

In [14]:
y_pred = model.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.95      0.97      0.96      2868
        spam       0.93      0.87      0.90      1202

    accuracy                           0.94      4070
   macro avg       0.94      0.92      0.93      4070
weighted avg       0.94      0.94      0.94      4070

