
# **Gradient Boosting Classifier**

 **Objective:**
- To create a predictive model that accurately classifies scientific journals into their respective quartiles (Q1, Q2, Q3, Q4, NQ) by handling missing SJR values using regression imputation and leveraging a Gradient Boosting Classifier for classification. Users can specify their field of study, and the model will filter and classify the journals accordingly.



- Import modules 

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder



- Load the data

In [2]:
file_path = "C://Users//Alpana//Desktop//clg files//fall sem 2024-25//machine learning//dataset.xlsx"
data = pd.read_excel(file_path)

# Data Pre-Processing 

In [3]:
# Display basic information about the dataset
print(f"Dataset shape: {data.shape}")



Dataset shape: (29165, 24)


In [4]:
# Find and display the count of missing values in each column
missing_values_count = data.isnull().sum()
print("Missing values count per column:")
print(missing_values_count)


Missing values count per column:
Rank                        0
Sourceid                    0
Title                       0
Type                        0
Issn                        0
SJR                       210
SJR Best Quartile           0
H index                     0
Total Docs. (2023)          0
Total Docs. (3years)        0
Total Refs.                 0
Total Cites (3years)        0
Citable Docs. (3years)      0
Cites / Doc. (2years)       0
Ref. / Doc.                 0
%Female                     0
Overton                     0
SDG                         0
Country                     0
Region                      0
Publisher                 407
Coverage                    0
Categories                  0
Areas                       0
dtype: int64


_____________________________________________________
- Since our dataset contains 29,165 instances, and we have 210 rows with missing SJR values, we will use an advanced imputation technique to fill these missing values. 
- Given the context and the potential relationships between SJR and other features, we will use a regression model for imputation. 
This method will help us predict the missing SJR values based on other available features in the dataset.
________________________________________________________

 ## **Context**
    - Scientific journals are categorized into quartiles (Q1, Q2, Q3, Q4, NQ) based on their impact and influence within a specific field of study. The SCImago Journal Rank (SJR) is a critical bibliometric indicator that reflects the average number of weighted citations received by a journal's articles over a particular period. However, journals often have missing SJR values like in our dataset we have 210 missing values, making it challenging to accurately predict their quartile rankings. 

    - This project aims to develop a machine learning model to predict the quartile category of journals, considering the specific field of study and various available bibliometric factors, even in the presence of missing SJR values.

    **Bibliometric indicators** are quantitative measures used to evaluate the impact, productivity, and quality of academic research publications and their influence within the scientific community. These indicators are commonly used in academic and research institutions for various purposes, including assessing the performance of researchers, journals, institutions, and research programs. 

 **OUR Bibliometric indicators**
 
 
SJR Best Quartile
- The best quartile ranking of a journal based on its SJR (SCImago Journal Rank) score.
- Q1, Q2, Q3, Q4, NQ (not quartile ranked).

SJR (SCImago Journal Rank)
- The average number of weighted citations received in a particular year by articles published in the journal during the three preceding years.
- Data Type: Real number.

H Index
- The number of articles (h) in a journal that have received at least h citations over the entire period.
- Data Type: Integer.

Total Docs. (2017)
- The total number of documents published by the journal in 2017.
- Data Type: Real number.

Total Docs. (3 years)
- The total number of documents published by the journal over the previous three years (2014-2016).
- Data Type: Integer.

Total Refs.
- The total number of references included in the journal's articles published in 2017.
- Data Type: Real number.

Total Cites (3 years)
- The total number of citations received in 2017 by articles published in the journal during the previous three years (2014-2016).
- Data Type: Integer.

Citable Docs. (3 years)
- The number of citable documents (articles, reviews, conference papers) published in the journal over the previous three years (2014-2016).
- Data Type: Integer.

Cites/Doc. (2 years)
- The average number of citations per document in a two-year period.
- Data Type: Real number.

Ref./Doc.
- The average number of references per document published in the journal in 2017.
- Data Type: Real number.


_____________________________________________________
**FILLING MISSING VALUES USING REGRESSION MODEL**
___________________________________________________

In [5]:
# Display basic information about the dataset
print(f"Dataset shape: {data.shape}")



Dataset shape: (29165, 24)


In [6]:
data['SJR Best Quartile'].value_counts()

SJR Best Quartile
Q1    8702
Q2    7295
Q3    6674
Q4    6069
-      425
Name: count, dtype: int64

In [7]:
# Replace "-" with "NQ"
data['SJR Best Quartile'] = data['SJR Best Quartile'].replace("-", "NQ")

In [8]:
data['SJR Best Quartile'].value_counts()

SJR Best Quartile
Q1    8702
Q2    7295
Q3    6674
Q4    6069
NQ     425
Name: count, dtype: int64

In [9]:
# Identify numeric and categorical columns
numeric_columns = ['SJR', 'H Index', 'Total Docs. (2017)', 'Total Docs. (3 years)', 'Total Refs.', 'Total Cites (3 years)', 'Citable Docs. (3 years)', 'Cites/Doc. (2 years)', 'Ref./Doc.']
categorical_columns = ['Rank', 'Title', 'Issn', 'Coverage', 'Categories', 'Areas']


In [10]:
# Find and separate the rows with missing SJR
missing_sjr_data = data[data['SJR'].isna()]

In [11]:
# Define features and target for the rows with complete SJR values
complete_data = data.dropna(subset=['SJR'])
X_train = complete_data.drop(columns=['SJR', 'SJR Best Quartile'] + categorical_columns)
y_train = complete_data['SJR']

In [12]:
# Define features for the rows with missing SJR
X_test = missing_sjr_data.drop(columns=['SJR', 'SJR Best Quartile'] + categorical_columns)


In [13]:
# One-hot encode categorical variables
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

In [14]:
# Define features and target for the rows with complete SJR values
complete_data = data.dropna(subset=['SJR'])
X_train = complete_data.drop(columns=['SJR', 'Rank', 'Title', 'Issn', 'Coverage', 'Categories', 'Areas'])
y_train = complete_data['SJR']

In [15]:
# Define features for the rows with missing SJR
X_test = missing_sjr_data.drop(columns=['SJR', 'Rank', 'Title', 'Issn', 'Coverage', 'Categories', 'Areas'])

In [16]:
# One-hot encode categorical variables
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

- due to one hot encoding - testing and training data may have different feature columns 
- Alignment is important

In [17]:
# Align columns in the test set to match the training set
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

In [18]:
# Train the regression model
regressor = LinearRegression()


In [19]:
regressor.fit(X_train, y_train)

In [20]:
# Check the columns in both X_train and X_test
print("Training columns:")
print(X_train.columns)
print("Test columns:")
print(X_test.columns)

# Align columns in X_test to match X_train
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)


Training columns:
Index(['Sourceid', 'H index', 'Total Docs. (2023)', 'Total Docs. (3years)',
       'Total Refs.', 'Total Cites (3years)', 'Citable Docs. (3years)',
       'Cites / Doc. (2years)', 'Ref. / Doc.', '%Female',
       ...
       'Publisher_shi you yu tian ran qi di zhi bian ji bu',
       'Publisher_shu xue xue bao bian ji bu',
       'Publisher_slamic Republic of Iran Medical Council',
       'Publisher_suburban eV', 'Publisher_universitat de Trieste',
       'Publisher_universitatea de vest',
       'Publisher_xian dai sui dao ji shu bian ji bu',
       'Publisher_ying yong guang xue bian ji bu',
       'Publisher_zhong guo shi pin xue bao bian ji bu',
       'Publisher_zhong guo xue xi chong bing fang zhi za zhi she'],
      dtype='object', length=8249)
Test columns:
Index(['Sourceid', 'H index', 'Total Docs. (2023)', 'Total Docs. (3years)',
       'Total Refs.', 'Total Cites (3years)', 'Citable Docs. (3years)',
       'Cites / Doc. (2years)', 'Ref. / Doc.', '%Female',


In [21]:
# Predict missing SJR values
predicted_sjr = regressor.predict(X_test)
data.loc[data['SJR'].isna(), 'SJR'] = predicted_sjr

In [22]:
# Find and display the count of missing values in each column
missing_values_count = data.isnull().sum()
print("Missing values count per column:")
print(missing_values_count)

Missing values count per column:
Rank                        0
Sourceid                    0
Title                       0
Type                        0
Issn                        0
SJR                         0
SJR Best Quartile           0
H index                     0
Total Docs. (2023)          0
Total Docs. (3years)        0
Total Refs.                 0
Total Cites (3years)        0
Citable Docs. (3years)      0
Cites / Doc. (2years)       0
Ref. / Doc.                 0
%Female                     0
Overton                     0
SDG                         0
Country                     0
Region                      0
Publisher                 407
Coverage                    0
Categories                  0
Areas                       0
dtype: int64


# GRADIENT BOOSTING

In [23]:
# Function to filter data based on user input and train the classifier
def train_and_evaluate_model(field_of_study):
    filtered_data = data[data['Areas'].str.contains(field_of_study, case=False, na=False)].copy()
    
    if filtered_data.empty:
        print(f"No data available for the specified field of study: {field_of_study}")
        return
    
    # Encode target labels
    label_encoder = LabelEncoder()
    filtered_data.loc[:, 'SJR Best Quartile'] = label_encoder.fit_transform(filtered_data['SJR Best Quartile'])
    
    # Define features and target for classification
    X = filtered_data.drop(columns=['Rank', 'Title', 'Issn', 'Coverage', 'Categories', 'Areas', 'SJR Best Quartile'])
    y = filtered_data['SJR Best Quartile']
    
    # Ensure target variable is categorical
    y = y.astype('int')
    
    # One-hot encode categorical variables
    X = pd.get_dummies(X, drop_first=True)
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the Gradient Boosting Classifier
    classifier = GradientBoostingClassifier()
    classifier.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Classification report for field of study: {field_of_study}")
    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))
    print(f"Accuracy: {accuracy * 100:.2f}%")
    
    return accuracy

In [24]:
#Separate Combined Areas:
def get_unique_areas(data):
    areas = data['Areas'].str.split(';')
    unique_areas = set()
    for area_list in areas:
        if area_list:
            unique_areas.update([area.strip() for area in area_list])
    return unique_areas


In [25]:
# Show available fields of study   
unique_fields_of_study = get_unique_areas(data)
print("Available fields of study:")
for field in unique_fields_of_study:
    print(field)



Available fields of study:
Multidisciplinary
Materials Science
Neuroscience
Nursing
Mathematics
Pharmacology, Toxicology and Pharmaceutics
Psychology
Decision Sciences
Computer Science
Environmental Science
Veterinary
Social Sciences
Health Professions
Arts and Humanities
Chemistry
Biochemistry, Genetics and Molecular Biology
Physics and Astronomy
Economics, Econometrics and Finance
Earth and Planetary Sciences
Business, Management and Accounting
Energy
Chemical Engineering
Dentistry
Medicine
Immunology and Microbiology
Agricultural and Biological Sciences
Engineering


In [26]:
# Example usage
field_of_study = input("Enter the field of study: ")
total_accuracy = train_and_evaluate_model(field_of_study)
print(f"Total accuracy of the model for the specified field of study '{field_of_study}' is: {total_accuracy * 100:.2f}%")

Classification report for field of study: Chemical Engineering
              precision    recall  f1-score   support

          NQ       1.00      0.50      0.67         2
          Q1       0.88      0.98      0.93        47
          Q2       0.85      0.83      0.84        42
          Q3       0.96      0.84      0.90        31
          Q4       0.95      1.00      0.98        20

    accuracy                           0.90       142
   macro avg       0.93      0.83      0.86       142
weighted avg       0.90      0.90      0.90       142

Accuracy: 90.14%
Total accuracy of the model for the specified field of study 'Chemical Engineering' is: 90.14%
