# Homework 7: Text Classifier

This  homework is more self-directed than previous homework assignments.

The goal is simple:

* Using the data from the `pubmed_sample_train.csv` file, train a classifier of your choosing for the target variable - about a particular type of article topic
* Make a prediction of the target in `pubmed_sample_test.csv` and save it to a file
* Submit your code, requirements, notebook (as usual) plus a copy of `pubmed_sample.test.csv` that includes a column with your predictions

In addition to reviewing your code, I will provide you with a score of how accurate your predictions were.

Notes:

* For grading: I am not going to grade based on your model score, but rather whether you have a complete, working example and valid predictions that are hopefully better than random!
* Please follow some best practices (such as handling train/test/validation data) for training a classification model and make some effort to build a solid model. Don't forget about feature selection if using sparse features from bag-of-words.
* In order to be sure that your code works, you may want to create your own 'test' data file from the training data where you can make sure your prediction method works


Some libraries to consider:

* As we've already seen in class, scikit-learn contains many 
* scikit-learn contains a number of ways to do the text featurization we talked about in class, including a [Count Vectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer). Be aware that these follow a similar `.fit()` and `.transform()` approach as a model. If you apply these to your training data, you should use the version that is fit on your training data to your test data.
*  If you would like to try stemming, [NLTK](https://www.nltk.org/api/nltk.html) has some options



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk

In [3]:
# Ensure necessary NLTK resources are available
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/yashwanth/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
train_df = pd.read_csv('pubmed_sample_train.csv')
test_df = pd.read_csv('pubmed_sample_test.csv')

### Don't overlook the class imbalance. Like most text classification tasks, there is a large class imbalance here.


In [6]:
train_df.target.value_counts()

target
False    14729
True       337
Name: count, dtype: int64

In [7]:
train_df

Unnamed: 0,title,abstract,journal,pubdate,target,article_number
0,The role of the foam formulation in improving ...,BACKGROUND\nTopical therapies represent the fi...,European review for medical and pharmacologica...,2023,False,3866
1,A nigroincertal projection mediates aversion a...,Recent studies have shown that the non-DA neur...,FASEB journal : official publication of the Fe...,2023,False,6406
2,Ion binding with charge inversion combined wit...,"Membraneless organelles, or biomolecular conde...",Cell reports,2023,False,5703
3,Effects of pulmonary-based Qigong exercise in ...,BACKGROUND\nPhysical exercise training is the ...,BMC complementary medicine and therapies,2023,False,6904
4,Incidence and risk of stroke in Korean patient...,OBJECTIVES\nThe incidence and risk of ischemic...,Journal of stroke and cerebrovascular diseases...,2023,False,5749
...,...,...,...,...,...,...
15061,miR-124 as a Liquid Biopsy Prognostic Biomarke...,Despite advances in non-small cell lung cancer...,International journal of molecular sciences,2023,False,11285
15062,Insights into the Effect of Chitosan and β-Cyc...,Synthetic zeolite-A (ZA) was hybridized with t...,"Molecules (Basel, Switzerland)",2023,False,11965
15063,Production of the C/TiO2 composite with a high...,Resource recycling from waste-water and sludge...,Journal of environmental sciences (China),2024,False,5391
15064,Descriptive regression tree analysis of inters...,BACKGROUND\nWhile self-rated health (SRH) is a...,PloS one,2023,False,861


In [8]:
# Combine title and abstract for better context
def preprocess_text(df):
    df['text'] = df['title'].fillna('') + ' ' + df['abstract'].fillna('')
    return df

train_df = preprocess_text(train_df)
test_df = preprocess_text(test_df)

In [9]:
# Apply stemming 
stemmer = PorterStemmer()

def apply_stemming(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

train_df['text'] = train_df['text'].apply(apply_stemming)
test_df['text'] = test_df['text'].apply(apply_stemming)

In [10]:
# Vectorize text using TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X_train_full = vectorizer.fit_transform(train_df['text'])
X_test = vectorizer.transform(test_df['text'])

In [11]:
# Extract target variable
y_train_full = train_df['target']

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=42)


In [12]:
# Train a classifier
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [13]:
# Evaluate on validation set
y_val_pred = model.predict(X_val)
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

       False       0.99      1.00      0.99      2955
        True       1.00      0.27      0.43        59

    accuracy                           0.99      3014
   macro avg       0.99      0.64      0.71      3014
weighted avg       0.99      0.99      0.98      3014



In [14]:
# Predict on the test set
test_df['predictions'] = model.predict(X_test)

In [15]:
# Save the modified test dataset
test_df.to_csv('pubmed_sample_test_with_predictions.csv', index=False)