Text Classification for Health-related Messages:
Objective: Classify health-related text messages (e.g., SMS or forum posts) into categories such as symptoms, treatment, or advice.
Techniques: Text Preprocessing, TF-IDF, Naive Bayes, Logistic Regression.
Tools: Python, NLTK, Scikit-Learn, Pandas.
Dataset: Health-related text datasets from public repositories or custom datasets.
General Workflow for Each Project:
Data Collection: Obtain the necessary text data from public datasets or through web scraping.
Data Preprocessing: Clean and preprocess the text data, including tokenization, stopword removal, and stemming/lemmatization.
Feature Extraction: Convert text data into numerical representations using techniques like Bag of Words, TF-IDF, or word embeddings.
Model Development: Train machine learning models to achieve the project's objective.
Model Evaluation: Evaluate the performance of the model using metrics like accuracy, precision, recall, and F1-score.
Optimization: Tune hyperparameters to improve model performance.
Documentation: Document the process, results, and insights gained from the project.
API: Pickle the model file and Create user testing API using any web framework for demonstration

In [1]:
import pandas as pd
df= pd.read_csv(r'C:\Users\aditya.devdhe\Downloads\medicalData.csv',encoding='latin1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,a
0,0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...
3,3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,4,Thyroid_Cancer,This study aimed to investigate serum matrix ...


In [2]:
df.shape

(7570, 3)

In [3]:
df.rename(columns={"0":"Category","a":"Text"},inplace=True)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Category,Text
0,0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...
3,3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,4,Thyroid_Cancer,This study aimed to investigate serum matrix ...


In [5]:
df.Category.value_counts()

Category
Thyroid_Cancer    2810
Colon_Cancer      2580
Lung_Cancer       2180
Name: count, dtype: int64

In [6]:
df['category_num']=df['Category'].map({'Thyroid_Cancer':0,'Colon_Cancer':1,'Lung_Cancer':2})

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,Category,Text,category_num
0,0,Thyroid_Cancer,Thyroid surgery in children in a single insti...,0
1,1,Thyroid_Cancer,""" The adopted strategy was the same as that us...",0
2,2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...,0
3,3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...,0
4,4,Thyroid_Cancer,This study aimed to investigate serum matrix ...,0


In [9]:
import spacy

In [10]:
nlp=spacy.load("en_core_web_sm")

In [11]:
def count_words(text):
    return len(text.split())

In [12]:
df['word_count'] = df['Text'].apply(count_words)
df['word_count']

0       2871
1       2494
2       2954
3       1880
4       3037
        ... 
7565    1429
7566    1252
7567    4510
7568    4051
7569    4385
Name: word_count, Length: 7570, dtype: int64

In [13]:
def limit_words(text, max_words):
    words = text.split()
    return ' '.join(words[:max_words])

max_words = 500
df['Text'] = df['Text'].apply(lambda x: limit_words(x, max_words))

In [14]:
df['word_count'] = df['Text'].apply(count_words)
df['word_count']

0       500
1       500
2       500
3       500
4       500
       ... 
7565    500
7566    500
7567    500
7568    500
7569    500
Name: word_count, Length: 7570, dtype: int64

In [15]:
def preprocess(text):
    doc = nlp(text.lower())
    filtered_tokens=[]
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens)

In [16]:
df["preprocess_text"]=df['Text'].apply(preprocess)

In [19]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(
    df.preprocess_text,
    df.category_num,
    test_size=0.2,
    random_state=42,
    stratify=df.category_num
)

In [20]:
y_train.value_counts()

category_num
0    2248
1    2064
2    1744
Name: count, dtype: int64

In [39]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

clfNB=Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('MultiNomialNB',MultinomialNB())
])
clfNB.fit(X_train,y_train)

In [40]:
y_pred= clfNB.predict(X_test)

In [41]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92       562
           1       0.90      0.93      0.91       516
           2       1.00      1.00      1.00       436

    accuracy                           0.94      1514
   macro avg       0.94      0.94      0.94      1514
weighted avg       0.94      0.94      0.94      1514



In [42]:
from sklearn.ensemble import RandomForestClassifier
clfRF= Pipeline([
    ('vectorizer_tf',TfidfVectorizer()),
    ('randomforest',RandomForestClassifier())
])
clfRF.fit(X_train,y_train) 

In [43]:
y_pred=clfRF.predict(X_test)

In [44]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       562
           1       1.00      0.99      1.00       516
           2       1.00      1.00      1.00       436

    accuracy                           1.00      1514
   macro avg       1.00      1.00      1.00      1514
weighted avg       1.00      1.00      1.00      1514



In [45]:
import pickle
pickle_out= open("capstone3.pkl","wb")
pickle.dump(clfNB,pickle_out)
pickle_out.close()