<a href="https://colab.research.google.com/github/daryaami/NLP-ITMO-Course/blob/Task1/Task1_NLP_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Задание 1. Курс "Анализ естественного языка методами машинного обучения"

Для анализа был выбран датасет ["News Articles Classification Dataset for NLP & ML"](https://www.kaggle.com/datasets/banuprakashv/news-articles-classification-dataset-for-nlp-and-ml) с платформы Kaggle.
Этот набор данных предлагает обширную коллекцию новостных статей, охватывающих различные области, включая бизнес, технологии, спорт, образование и развлечения.

Для этого датасета мы будем решать задачу многоклассовой классификации новостей по областям. Для оценки качества модели для такой задачи можно использовать усредненную F1-меру.

## Импорт библиотек и датасета

In [None]:
import os
import pandas as pd
import plotly.express as px
import seaborn as sns

In [None]:
paths = []

folder_path = '/content'
files_and_folders = os.listdir(folder_path)
paths = [file for file in files_and_folders if os.path.isfile(os.path.join(folder_path, file)) and file.endswith('.csv')]

print(paths)

['sports_data.csv', 'business_data.csv', 'technology_data.csv', 'entertainment_data.csv', 'education_data.csv']


In [None]:
data = pd.DataFrame(columns=['headlines', 'description', 'content', 'url', 'category'])

for path in paths:
    data = pd.concat([data, pd.read_csv(path)])

data

Unnamed: 0,headlines,description,content,url,category
0,Jannik Sinner’s strong and improved baseline g...,Baseline strategy has been central to Jannik S...,"In a resolute, nerveless, three-hour-long disp...",https://indianexpress.com/article/sports/tenni...,sports
1,Allan Border opens up about Parkinson’s diseas...,Former Australia captain says the disease has ...,Legendary Australia cricketer Allan Border has...,https://indianexpress.com/article/sports/crick...,sports
2,Australian Open 2024: Jannik Sinner stops Nova...,"Sinner defeated Djokovic 6-1, 6-2, 6-7, 6-3 in...",Jannik Sinner has stunned Novak Djokovic to re...,https://indianexpress.com/article/sports/tenni...,sports
3,‘It looked like a dead lbw’: Ben Duckett on Sh...,Duckett believes England are still in the game...,England opener Ben Duckett says Shubman Gill s...,https://indianexpress.com/article/sports/crick...,sports
4,"In crunch time, Kiran George produces an aggre...",As Kiran George had missed out on opportunitie...,"Since the start of 2023, Kiran George had play...",https://indianexpress.com/article/sports/badmi...,sports
...,...,...,...,...,...
1995,"PSEB Class 10th Results 2023: When, where and ...",PSEB Class 10th Results 2023: The students wil...,PSEB Class 10th Results 2023: The Punjab Schoo...,https://indianexpress.com/article/education/ps...,education
1996,SAMS Odisha invites application for admission ...,The admission process will commence with accep...,"The Directorate of Higher Secondary Education,...",https://indianexpress.com/article/education/sa...,education
1997,"MPBSE MP Class 10th, 12th Supplementary exams ...","MPBSE MP Class 10, 12 Supplementary exams: Stu...","MPBSE MP Class 10, 12 Supplementary exams: The...",https://indianexpress.com/article/education/mp...,education
1998,KCET 2023: KEA releases provisional answer key...,KCET 2023: Candidates who appeared for the exa...,KCET 2023: The Karnataka Examination Authority...,https://indianexpress.com/article/education/kc...,education


Для дальнейшей работы нам достаточно двух колонок - само содержание статьи и метка (таргет, который мы будем предсказывать)

In [None]:
data = data[['content', 'category']]

Сравним среднюю длину контента в разных категориях. Как можно заметить, статьи про бизнес и спорт значительно длинее, чем в других категориях.

In [None]:
category_avg_length = data.groupby('category')['content'].apply(lambda x: x.str.len().mean()).reset_index(name='avg_length')

fig = px.bar(category_avg_length, y='category', x='avg_length', orientation='h', labels={'avg_length': 'Average Content Length'})
fig.update_layout(showlegend=False)
fig.update_xaxes(title='Average Length')
fig.update_yaxes(title='Category')
fig.show()

## Предобработка

In [None]:
!python -m spacy download en_core_web_sm > None

In [None]:
import re
import nltk
import spacy
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer,  TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer
from string import punctuation

from nltk.tokenize import (
    word_tokenize
    )

nltk.download('punkt')
nltk.download('stopwords')
# nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
data.sample(5)

Unnamed: 0,content,category
1937,All eyes are on India vs Australia World Cup 2...,entertainment
1491,Pro tennis player Jessica Pegula has revealed ...,sports
1131,"In the world of films, neither tears nor eyela...",entertainment
626,The rupee depreciated 9 paise to close at an a...,business
845,Australian opener David Warner will miss the r...,sports


In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['content'], data['category'],
                                                    test_size=0.2,
                                                    random_state=2024, stratify=data['category'])

In [None]:
def custom_preprocessor(text):
    text = re.sub(r'[^\w\s\']', '', text)
    text = text.lower()
    return text

In [None]:
vectorizer = CountVectorizer(max_df=0.95, min_df=0.1,
                             lowercase=False,
                             token_pattern=r"(?u)\b\w\w+\b",
                             preprocessor=custom_preprocessor,

                             stop_words=stopwords.words('english'))

In [None]:
X_train_countvectorized = vectorizer.fit_transform(X_train)
X_test_countvectorized = vectorizer.transform(X_test)

In [None]:
vectorizer.get_feature_names_out()

array(['10', '2023', '2024', 'according', 'account', 'added',
       'advertisement', 'along', 'already', 'also', 'announced', 'around',
       'article', 'back', 'bollywood', 'buy', 'cent', 'click', 'company',
       'continue', 'could', 'day', 'email', 'entertainment', 'even',
       'exhausted', 'express', 'facebook', 'film', 'first', 'free', 'get',
       'google', 'headlines', 'however', 'including', 'india', 'indian',
       'last', 'latest', 'like', 'limit', 'made', 'make', 'many',
       'monthly', 'national', 'need', 'new', 'news', 'next', 'official',
       'one', 'per', 'premium', 'read', 'reading', 'register', 'rs',
       'said', 'second', 'set', 'sign', 'since', 'stories', 'story',
       'subscriber', 'subscription', 'take', 'team', 'test', 'three',
       'time', 'top', 'two', 'updates', 'us', 'well', 'world', 'would',
       'year', 'years'], dtype=object)

In [None]:
pd.DataFrame(X_train_countvectorized.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,10,2023,2024,according,account,added,advertisement,along,already,also,...,time,top,two,updates,us,well,world,would,year,years
0,2,0,0,0,0,0,3,2,0,4,...,2,2,2,0,0,5,0,1,0,1
1,0,0,0,0,0,1,3,0,1,0,...,0,4,0,0,1,0,3,1,0,1
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,0,2,0,0,0,0,5,0,0,1,...,0,0,0,0,0,0,0,0,3,0
4,0,1,0,0,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7996,0,0,0,0,1,0,0,2,1,1,...,0,1,0,2,0,0,1,0,0,0
7997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7998,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Посмотрим, как справятся константная и базовые модели.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent").fit(X_train_countvectorized, y_train)
preds = dummy_clf.predict(X_test_countvectorized)
print(classification_report(y_test, preds, zero_division=0))

               precision    recall  f1-score   support

     business       0.20      1.00      0.33       400
    education       0.00      0.00      0.00       400
entertainment       0.00      0.00      0.00       400
       sports       0.00      0.00      0.00       400
   technology       0.00      0.00      0.00       400

     accuracy                           0.20      2000
    macro avg       0.04      0.20      0.07      2000
 weighted avg       0.04      0.20      0.07      2000



Для категории "business" модель достигла полноты 1.0, что означает, что она классифицировала все документы этой категории правильно. Однако точность этой категории составляет всего 0.2, что говорит о том, что только 20% документов, классифицированных как "business", действительно относятся к этой категории.
Для всех остальных категорий точность, полнота и f1-score составляют 0.0, что означает, что модель не смогла правильно классифицировать документы для этих категорий.

In [None]:
logreg = LogisticRegression().fit(X_train_countvectorized, y_train)
preds = logreg.predict(X_test_countvectorized)
print(classification_report(y_test, preds))

               precision    recall  f1-score   support

     business       0.86      0.83      0.85       400
    education       0.82      0.79      0.80       400
entertainment       0.96      0.87      0.91       400
       sports       0.79      0.90      0.84       400
   technology       0.78      0.80      0.79       400

     accuracy                           0.84      2000
    macro avg       0.84      0.84      0.84      2000
 weighted avg       0.84      0.84      0.84      2000




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



Логистическая регрессия показывает хорошие результаты в классификации новостных статей по различным категориям, превышая простое угадывание и достигая общей точности в 0.84

In [None]:
!pip install eli5 > None

In [None]:
import eli5

In [None]:
eli5.show_weights(
    estimator=logreg,
    feature_names=list(vectorizer.get_feature_names_out()),
    top=(10, 5)
)

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4
+1.651,premium,,,
+1.319,advertisement,,,
+1.109,subscriber,,,
+1.092,story,,,
+0.913,cent,,,
+0.898,limit,,,
+0.855,email,,,
+0.796,company,,,
+0.763,monthly,,,
+0.676,exhausted,,,

Weight?,Feature
+1.651,premium
+1.319,advertisement
+1.109,subscriber
+1.092,story
+0.913,cent
+0.898,limit
+0.855,email
+0.796,company
+0.763,monthly
+0.676,exhausted

Weight?,Feature
+1.275,read
+1.039,register
+0.997,test
+0.972,national
+0.968,reading
+0.936,subscriber
+0.827,official
+0.685,per
+0.642,2024
+0.605,10

Weight?,Feature
+3.516,film
+1.878,bollywood
+1.702,headlines
+1.235,entertainment
+1.140,story
+1.051,register
+1.040,read
+0.994,latest
+0.756,premium
+0.754,monthly

Weight?,Feature
+2.077,click
+1.863,exhausted
+1.440,subscription
+1.157,team
+1.044,test
+1.041,sign
+1.040,buy
+0.998,<BIAS>
+0.860,article
+0.749,world

Weight?,Feature
+2.929,premium
+1.704,monthly
+1.377,google
+1.071,company
+0.813,subscriber
+0.758,latest
+0.738,according
+0.643,free
+0.606,updates
+0.592,could
