# Meta Analysis Classification Task
As a meta-analysis classification task, select at least three topics such as "kidney disease", 
"type 2 diabetics", "brain injury", etc. 

Here is steps:
- Preparing the dataset for the classification task
- Selecting at least 4 classification techniques
- Applying regularization and experimenting parameters of techniques to improve the initial 
accuracy scores.

In [1]:
# necessary installments
!pip install Bio # for accesing pubmed database

Collecting Bio
  Downloading bio-1.6.0-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.4/279.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting biopython>=1.80 (from Bio)
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
Collecting mygene (from Bio)
  Downloading mygene-3.2.2-py2.py3-none-any.whl (5.4 kB)
Collecting gprofiler-official (from Bio)
  Downloading gprofiler_official-1.0.0-py3-none-any.whl (9.3 kB)
Collecting biothings-client>=0.2.6 (from mygene->Bio)
  Downloading biothings_client-0.3.1-py2.py3-none-any.whl (29 kB)
Installing collected packages: biopython, gprofiler-official, biothings-client, mygene, Bio
Successfully installed Bio-1.6.0 biopython-1.81 biothings-client-0.3.1 gprofiler-official-1.0.0 mygene-3.2.2


In [93]:
# necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

# Data Gathering

We prefer to use National Library of Medicine Pubmed database to gather related meta-data for the analysis.

To do that we used entrez programming utilities, as e-utilities, API.

References:
* https://www.ncbi.nlm.nih.gov/books/NBK25501/
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [64]:
############################
# Data Gathering
############################

from Bio import Entrez
from Bio import Medline

database = "pubmed"

# Set values for the function inputs:
queries = ["type 2 diabetics", "brain injury", "kidney disease"]
retmax = 1000
df_list = []


# your mail here
Entrez.email = ''

def filterPubmedArticles(database, keyword, retmax):
    handle = Entrez.esearch(db=database, term=f'("{keyword}"[Title/Abstract])', retmax=retmax, sort='Relevance')
    record = Entrez.read(handle)
    handle.close()

    id_list = record['IdList']
    if not id_list:
        print("No articles found - consider broadening query terms.")
        return
    else:
        handle = Entrez.efetch(db='pubmed', id=id_list, rettype='medline', retmode='text')
        records = Medline.parse(handle)
        articles = []

        # loop through records; grab desired fields (here just title & abstract)
        for record in records:
            title = record.get('TI', '')
            abstract = record.get('AB', '')
            articles.append({'Topic': keyword,
                             'Title': title,
                             'Abstract': abstract,
                             'Text': title + ' ' + abstract,
                             })

        # now use pandas to write the resulting records to a DataFrame.
        df = pd.DataFrame(articles)
        df_list.append(df)


for query in queries:
  filterPubmedArticles(database, query, retmax)

# Concatenate the list of DataFrames into a single DataFrame
combined_df = pd.concat(df_list, ignore_index=True)

# save as csv format
combined_df.to_csv('articles.csv', index=False)
print(f"{combined_df.shape[0]} articles found. Output saved to 'articles.csv'.")

3000 articles found. Output saved to 'articles.csv'.


# Data Analysis

In [65]:
#############################
# Data Analysis
#############################

df_ = pd.read_csv("articles.csv")
df = df_.copy()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     3000 non-null   object
 1   Title     2997 non-null   object
 2   Abstract  2502 non-null   object
 3   Text      3000 non-null   object
dtypes: object(4)
memory usage: 93.9+ KB


In [66]:
def check_values(dataframe, col):
  print(f"Unique occurances ratio: {(dataframe[col].nunique() / dataframe[col].count()) * 100}")
  print(f"Explain data ratio: {(dataframe[col].count() / dataframe.shape[0]) * 100}")

for col in df.columns:
  print(f"#############{col}############")
  check_values(df, col)

#############Topic############
Unique occurances ratio: 0.1
Explain data ratio: 100.0
#############Title############
Unique occurances ratio: 99.66633299966632
Explain data ratio: 99.9
#############Abstract############
Unique occurances ratio: 99.52038369304557
Explain data ratio: 83.39999999999999
#############Text############
Unique occurances ratio: 99.8
Explain data ratio: 100.0


In [67]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq
Topic,3000,3,type 2 diabetics,1000
Title,2997,2987,"Burden, Access, and Disparities in Kidney Dise...",4
Abstract,2502,2490,World Kidney Day 2016 focuses on kidney diseas...,4
Text,3000,2994,"Burden, Access, and Disparities in Kidney Dise...",4


In [68]:
df.head()

Unnamed: 0,Topic,Title,Abstract,Text
0,type 2 diabetics,Intestinal parasitic infections among type-2 d...,Intestinal parasitic infections (IPIs) are com...,Intestinal parasitic infections among type-2 d...
1,type 2 diabetics,Creating the amylin story.,This paper is based on a presentation given at...,Creating the amylin story. This paper is based...
2,type 2 diabetics,Delay of insulin therapy in type 2 diabetics.,Type 2 diabetes (T2DM) is a progressive diseas...,Delay of insulin therapy in type 2 diabetics. ...
3,type 2 diabetics,Cardiovascular risk associated with co-morbid ...,"OBJECTIVE: In the general population, co-morbi...",Cardiovascular risk associated with co-morbid ...
4,type 2 diabetics,Adipocytokines and anthropometric measures in ...,AIM: Type 2 diabetes mellitus has assumed pand...,Adipocytokines and anthropometric measures in ...


In [74]:
df.Text[0]

"Intestinal parasitic infections among type-2 diabetics seeking healthcare in Sana'a city, Yemen. Intestinal parasitic infections (IPIs) are common in developing countries, particularly in countries witnessing conflicts and humanitarian crises like Yemen. Type-2 diabetics are among the population categories most vulnerable to a variety of infections, including IPIs. Therefore, this study determined the prevalence and risk factors associated with IPIs among Yemeni type-2 diabetics in Sana'a city. This hospital-based, cross-sectional study recruited 389 type-2 diabetics seeking healthcare in Sana'a from December 2019 to February 2020. Sociodemographic data and risk factors were collected from interviewed participants using a structured questionnaire. Stool samples were collected and examined for parasites using standard techniques. The association of sociodemographic characteristics and risk factors with IPIs was tested using univariate analysis, and a multivariable logistic regression m

### Missing Values

In [82]:
df.isnull().sum()

Topic                0
Title                3
Abstract           498
Text                 0
Text_Word_Count      0
dtype: int64

In [83]:
# drop the nan values
df.dropna(inplace=True)

In [84]:
def word_count(text):
  return len(text.split(" "))

df["Text_Word_Count"] = df["Text"].apply(lambda x: word_count(x))
df["Text_Word_Count"].head()

0    295
1    200
2    276
3    264
4    251
Name: Text_Word_Count, dtype: int64

# Feature Engineering

In [86]:
def sentence_regex(text):
    # Remove all the special characters
    text = re.sub(r'\W', ' ', text)

    # remove all single characters
    text= re.sub(r'\s+[a-zA-Z]\s+', ' ', text)

    # Remove single characters from the start
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)

    # Substituting multiple spaces with single space
    text = re.sub(r'\s+', ' ', text, flags=re.I)

    # Removing prefixed 'b'
    text = re.sub(r'^b\s+', '', text)

    # Removing numbers
    text = re.sub(r'\d+', '', text)

    # Converting to Lowercase
    text = text.lower()

    return text

df['Text'] = df['Text'].apply(lambda x: sentence_regex(x))

In [87]:
df.Text.head()

0    intestinal parasitic infections among type  di...
1    creating the amylin story this paper is based ...
2    delay of insulin therapy in type  diabetics ty...
3    cardiovascular risk associated with co morbid ...
4    adipocytokines and anthropometric measures in ...
Name: Text, dtype: object

In [89]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df["Topic"])
y

array([2, 2, 2, ..., 1, 1, 1])

We used LabelEncoder to convert the categoric values to integer values for the model. In Topic column we have 3 different categories and they are not ordinal values. To sum up, the LabelEncoder give incremental values to this categories but their order not meaningful.

## Word Analysis

In [90]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import nltk

nltk.download('stopwords')
vectorizer = TfidfVectorizer (stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(df["Text"]).toarray()

feature_names = vectorizer.get_feature_names_out()
X = pd.DataFrame(X, columns=feature_names)

X

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,aa,aaa,aac,aace,aai,aarhus,aasi,aat,aav,ab,...,zoledronic,zolendronic,zolpidem,zona,zone,zopiclone,zoster,zucker,zung,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [117]:
from sklearn.decomposition import TruncatedSVD

# Dimensionality reduction
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X)

print(f"Total variance explained: {np.sum(svd.explained_variance_ratio_):.2f}")
# The variance explained is quite low for real applications.

Total variance explained: 0.05


If we choose n_components as 1, total_variance will become 0.01. The problem is model accuracy is decreasing too much.

# Model

Training the data we have needs to 3 step.
- Configuring the base models.
- Hyperparameter optimizations.
- Stacking and Ensemble Learning

In [118]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

lr = LogisticRegression()
result = cross_validate(lr, X_svd, y, cv=5)

result['test_score'].mean()

0.9784047904191617

## Base Model

In [120]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
import warnings
warnings.filterwarnings("ignore")

def base_models(X, y, scoring="roc_auc"):
    print("Base Models....")
    classifiers = [
                    ('LR', LogisticRegression()),
                    ('RF', RandomForestClassifier()),
                  ]

    for name, classifier in classifiers:
        cv_results = cross_validate(classifier, X, y, cv=5, scoring=scoring)
        print(f"{scoring}: {round(cv_results['test_score'].mean(), 4)} ({name}) ")

base_models(X_svd, y, scoring="accuracy")

Base Models....
accuracy: 0.9784 (LR) 
accuracy: 0.9928 (RF) 


## Hyperparameter optimization

To optimize the hyperparamaters we used GridSearchCV. GridSearchCV (Grid Search Cross-Validation) is a technique used in machine learning to find the optimal hyperparameters for a model.

GridSearchCV performs an exhaustive search over a specified parameter grid. You provide a set of hyperparameter values to explore, and GridSearchCV evaluates the model's performance for each combination of hyperparameter values.

To ensure the results are not dependent on a specific train-test split, GridSearchCV uses cross-validation. It divides the dataset into multiple subsets (folds), and for each combination of hyperparameters, it trains the model on a subset of the data and evaluates it on the remaining data.

References:
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [121]:
from sklearn.model_selection import GridSearchCV

# Parameters for models
knn_params = {"n_neighbors": range(2, 50)}

rf_params = {"max_depth": [3, 5],
             "max_features": [5, 2500, "auto"],
             "min_samples_split": [15, 20],
             "n_estimators": [200, 300]}

svc_params = {"C": [0.1, 1, 10], # regularization parameter, limits importance of each feature
              "kernel": ['rbf', 'poly']}
              # gamma parameter on rbf checks gaussian width

ridge_params = {"alpha": [0.1, 1, 10]}

lr_params = {"penalty": ['l1', 'l2'],
             "C": [0.1, 1, 10]}

classifiers = [('LR', LogisticRegression(), lr_params),
               #('SVC', SVC(), svc_params),
               #('Ridge', RidgeClassifier(), ridge_params),
               #('KNN', KNeighborsClassifier(), knn_params),
               ("RF", RandomForestClassifier(), rf_params)]

def hyperparameter_optimization(X, y, cv=3, scoring="roc_auc"):
    print("Hyperparameter Optimization....")
    best_models = {}
    for name, classifier, params in classifiers:
        print(f"########## {name} ##########")
        cv_results = cross_validate(classifier, X, y, cv=cv, scoring=scoring)
        print(f"{scoring} (Before): {round(cv_results['test_score'].mean(), 4)}")

        gs_best = GridSearchCV(classifier, params, cv=cv, n_jobs=-1, verbose=False).fit(X, y)
        final_model = classifier.set_params(**gs_best.best_params_)

        cv_results = cross_validate(final_model, X, y, cv=cv, scoring=scoring)
        print(f"{scoring} (After): {round(cv_results['test_score'].mean(), 4)}")
        print(f"{name} best params: {gs_best.best_params_}", end="\n\n")
        best_models[name] = final_model
    return best_models

best_models = hyperparameter_optimization(X_svd, y, scoring='accuracy')

Hyperparameter Optimization....
########## LR ##########
accuracy (Before): 0.976
accuracy (After): 0.9872
LR best params: {'C': 10, 'penalty': 'l2'}

########## RF ##########
accuracy (Before): 0.994
accuracy (After): 0.9944
RF best params: {'max_depth': 5, 'max_features': 'auto', 'min_samples_split': 15, 'n_estimators': 200}



## Stacking and Ensemble Learning

A voting classifier is an ensemble method that combines the predictions from multiple individual models to make a final prediction. It can be used for both classification and regression tasks. The basic idea is to aggregate the predictions of each base model and produce a final prediction based on a majority vote (for classification) or an average (for regression).

In [123]:
from sklearn.ensemble import VotingClassifier

def voting_classifier(best_models, X, y):
    print("Voting Classifier...")
    voting_clf = VotingClassifier(estimators=[('LR', best_models["LR"]), ('RF', best_models["RF"])],
                                  voting='soft').fit(X, y)
    cv_results = cross_validate(voting_clf, X, y, cv=5, scoring=["accuracy"])
    print(f"Accuracy: {cv_results['test_accuracy'].mean()}")
    return voting_clf

voting_classifier(best_models, X, y)

Voting Classifier...
Accuracy: 0.9984015968063872
