---
title: "Feature Selection for Text Data"
format:
  html:
    page-layout: full
    code-fold: show
    code-copy: true
    code-tools: true
    code-overflow: wrap
---

In this segment, we will implement the Naive Bayes technique, specifically tailored for text data. The dataset at hand is sourced from NewsAPI and revolves around various topics, with an additional 'category' column incorporated to facilitate our analysis. Our focus will be narrowed down to the initial 10 documents, aiming to evaluate the model's accuracy and its effectiveness in accurately classifying certain articles.

To commence, we will embark on a comprehensive data cleaning process, ensuring that our text data is free from any inconsistencies or unwanted characters that could potentially skew our results. Following this, we will proceed to divide our dataset into training and test sets. This critical step allows us to train our model on one subset of the data and validate its performance on another, ensuring that our evaluations are robust and reliable.

Through this meticulous approach, we aim to enhance the performance of our Naive Bayes model, providing us with insightful and accurate classifications of the news articles based on their content.

In [19]:
import numpy as np 
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import sklearn 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn import metrics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, accuracy_score

In [10]:
text_df = pd.read_csv("cleaned_data/medicine-cleaned.csv")

text_df.head() #we are only focusing on the first 10 rows

Unnamed: 0,title,description
0,a round-up of the talks from gaconf usa 2023,gaconf returned to the us this week with a ser...
1,american can prevent (and control) type 2 diab...,usa today's health team spoke with scores of e...
2,the steep cost of type 2: when diabetes dragge...,the nation's disjointed and confusing health c...
3,a hidden system of exploitation underpins us h...,this series was produced in partnership with t...
4,the childcare cliff: $122 billion dollar crisi...,the childcare crisis in america; the harsh tru...


First, we want to eliminate to first 10 rows of dataset. 

In [15]:
text_df = text_df.iloc[:10]
len(text_df)

text_df['category'] = [0, 1, 1, 2, 2, 3, 3, 4, 4, 5] 
text_df.head()

Unnamed: 0,title,description,category
0,a round-up of the talks from gaconf usa 2023,gaconf returned to the us this week with a ser...,0
1,american can prevent (and control) type 2 diab...,usa today's health team spoke with scores of e...,1
2,the steep cost of type 2: when diabetes dragge...,the nation's disjointed and confusing health c...,1
3,a hidden system of exploitation underpins us h...,this series was produced in partnership with t...,2
4,the childcare cliff: $122 billion dollar crisi...,the childcare crisis in america; the harsh tru...,2


## Train and Test Dataset

Now, we are transitioning into a crucial phase of our analysis: partitioning our dataset into training and test subsets. This is a pivotal step when preparing to apply the Naive Bayes algorithm specifically to text data.

The rationale behind this split is to ensure that we have a separate set of data on which to train our model (the training set), and another distinct set to evaluate its performance (the test set). By adopting this approach, we can accurately assess how well our model generalizes to unseen data, ensuring the robustness and reliability of our results.

Following this data split, we will be well-prepared to implement the Naive Bayes algorithm on our text data, striving for a model that delivers accurate and insightful predictions, thereby enhancing our understanding of the dataset at hand.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(text_df[['title', 'description']], text_df['category'], test_size=0.2, random_state=42)

X_train['text'] = X_train['title'] + ' ' + X_train['description']
X_test['text'] = X_test['title'] + ' ' + X_test['description']

## Vectorization

Leveraging text data necessitates its transformation into a numerical format, a process known as Text Vectorization, which is crucial for compatibility with machine learning algorithms. Text Vectorization intricately converts textual content into numerical values, creating a structured and analyzable representation of the dataset. This transformation not only facilitates the application of algorithms like Naive Bayes but also enhances the model’s capability to discern patterns and make predictions based on textual inputs. By adopting text vectorization, we bridge the gap between the rich, unstructured data present in text form and the numerical prerequisites of machine learning models, setting the stage for a more accurate and insightful analytical experience.

In [20]:
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train['text'])
X_test_vect = vectorizer.transform(X_test['text'])

## Model Training

Upon the successful vectorization of our text data, transforming our textual content into numerical format suitable for machine learning, we seamlessly transition into the Model Training phase. In this critical stage, we employ a Naive Bayes classifier, meticulously trained using our meticulously prepared text data. This probabilistic model, renowned for its efficacy in handling text classification tasks, learns the underlying patterns and associations between the words and their respective categories from the training dataset. By so doing, it equips itself to make informed and accurate predictions on unseen data, showcasing its potential to serve as a robust tool in text-based classification endeavors.

In [28]:
classify = MultinomialNB()
classify.fit(X_train_vect, y_train)

## Prediction

With our data now meticulously partitioned into training and testing sets, we are poised to apply the MultinomialNB() classifier, a variant of Naive Bayes specifically tailored for multinomially distributed data, which is often the case with text. By utilizing this classifier, we aim to make predictions on our test data. Lastly, we can simply calculateaccuracy and print classification report using prediction value. 

In [29]:
y_predict = classify.predict(X_test_vect)

In [31]:
print('Accuracy:', accuracy_score(y_test, y_predict))

print('Classification Report:\n', classification_report(y_test, y_predict))

Accuracy: 0.5
Classification Report:
               precision    recall  f1-score   support

           1       1.00      1.00      1.00         1
           2       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.33      0.33      0.33         2
weighted avg       0.50      0.50      0.50         2



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Conclusion

In conclusion, applying the Naive Bayes classifier to text data in Python encompasses critical stages such as data preprocessing, which entails refining the text for the algorithms; text vectorization, which transforms the text into a machine-readable numerical format; and model training, where the classifier is taught using the processed data. Following these steps is the evaluation of the model’s performance, which might reveal a lower-than-anticipated accuracy. However, this accuracy can potentially be improved by adjusting categories and experimenting with various text datasets, indicating that the model's effectiveness is contingent upon the nature and quality of the data it is fed.