In [16]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from collections import Counter
from wordcloud import WordCloud, STOPWORDS



In [None]:
# --- 1. Load and Prepare the Data ---
#load the dataset
try:
    df = pd.read_csv("metadata.csv")
    print("--- File Loaded Successfully ---")

    #display first few rows
    print(df.head())

    #summary of the dataset
    print(df.info())

    #rows and columns
    print(df.shape)

    #identify data types
    print(df.dtypes)

    #statistical summary
    df.describe()

    #missing values
    print(df.isnull().sum())

except FileNotFoundError:
    print("File not found. Please check the file path and try again.")
    exit()
    
#HANDLE MISSING DATA
# Define columns to remove
columns_to_drop = [
    'WHO #Covidence',
    'Microsoft Academic Paper ID',
    'sha',
    'full_text_file',
    'url'
]

# Drop the specified columns from the DataFrame
df_cleaned = df.drop(columns=columns_to_drop, errors='ignore')

print(f"\n--- Dropped columns: {', '.join(columns_to_drop)} ---")


#fill missing values
# Fill missing 'abstract' with a placeholder
df_cleaned['abstract'].fillna('No abstract available', inplace=True)

# Fill 'authors', 'journal', and 'license' with 'Unknown'
df_cleaned['authors'].fillna('Unknown', inplace=True)
df_cleaned['journal'].fillna('Unknown', inplace=True)
df_cleaned['license'].fillna('Unknown', inplace=True)

print("--- Filled missing values in abstract, authors, journal, and license ---")

#drop rows with mising title
# Get the row count before dropping
rows_before = len(df_cleaned)

# Drop rows where the 'title' is missing
df_cleaned.dropna(subset=['title'], inplace=True)

rows_after = len(df_cleaned)
print(f"--- Removed {rows_before - rows_after} rows with a missing title ---")

# Verify that there are no more missing values in the key columns
print("\nMissing values percentage after cleaning:")
print(df_cleaned.isnull().sum() / len(df_cleaned) * 100)

# Save the cleaned DataFrame to a new CSV file
cleaned_file_path = 'cleaned_metadata.csv'
df_cleaned.to_csv(cleaned_file_path, index=False)

print(f"\n--- Cleaning complete! ---")
print(f"Cleaned dataset saved to: {cleaned_file_path}")

--- File Loaded Successfully ---
   cord_uid                                       sha source_x  \
0  vho70jcx  f056da9c64fbf00a4645ae326e8a4339d015d155  biorxiv   
1  i9tbix2v  daf32e013d325a6feb80e83d15aabc64a48fae33  biorxiv   
2  62gfisc6  f33c6d94b0efaa198f8f3f20e644625fa3fe10d2  biorxiv   
3  058r9486  4da8a87e614373d56070ed272487451266dce919  biorxiv   
4  wich35l7  eccef80cfbe078235df22398f195d5db462d8000  biorxiv   

                                               title             doi pmcid  \
0  SIANN: Strain Identification by Alignment to N...  10.1101/001727   NaN   
1  Spatial epidemiology of networked metapopulati...  10.1101/003889   NaN   
2  Sequencing of the human IG light chain loci fr...  10.1101/006866   NaN   
3  Bayesian mixture analysis for metagenomic comm...  10.1101/007476   NaN   
4  Mapping a viral phylogeny onto outbreak trees ...  10.1101/010389   NaN   

   pubmed_id  license                                           abstract  \
0        NaN  biorxiv  Ne

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['abstract'].fillna('No abstract available', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['authors'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object o

cord_uid          0.000000
source_x          0.000000
title             0.000000
doi               7.297718
pmcid            42.481969
pubmed_id        24.061205
license           0.000000
abstract          0.000000
publish_time      0.019729
authors           0.000000
journal           0.000000
has_full_text     0.000000
dtype: float64

--- Cleaning complete! ---
Cleaned dataset saved to: cleaned_metadata.csv


# CORD-19 Research Paper Analysis 🔬

## 1. Data Loading and Preparation

### Objective
The first step is to load the `metadata.csv` dataset and perform initial cleaning. The goal of this phase is to prepare the data for exploratory analysis by handling missing values and formatting key columns.

### Actions Taken
The following cleaning and preparation steps were performed:

* **Load Data**: The dataset was loaded into a pandas DataFrame.
* **Drop Unnecessary Columns**: Columns with a high percentage of missing values (`WHO #Covidence`, `Microsoft Academic Paper ID`) or those not needed for this analysis (`sha`, `url`) were removed.
* **Handle Missing Values**:
    * Missing `journal` and `authors` were filled with the placeholder "Unknown".
    * Missing `abstract` text was filled with "No abstract available".
* **Format Dates**: The `publish_time` column was converted to a proper datetime format, and a new `publish_year` column was created to make time-based analysis easier.
* **Remove Invalid Rows**: Any rows with a missing `title` or an invalid `publish_time` were dropped to ensure data quality.

### Outcome
After these steps, the dataset is cleaned and contains the necessary columns for analysis. We can now proceed to the Exploratory Data Analysis (EDA) phase.

In [14]:
#prepare and perform basic data analysis
try:
    df = pd.read_csv("cleaned_metadata.csv")
    print("--- File Loaded Successfully ---\n")
except FileNotFoundError:
    print("File not found. Please check the file path and try again.")
    exit()

# --- PREPARE THE DATA ---
# Convert 'publish_time' to datetime format
df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce')

# Create 'publish_year' by extracting the year
# We drop any rows where the date was invalid to prevent errors
df.dropna(subset=['publish_time'], inplace=True)
df['publish_year'] = df['publish_time'].dt.year.astype(int)
print("--- Data Preparation Complete (created 'publish_year' column) ---\n")


# ---PERFORM BASIC DATA ANALYSIS ---

# --- Count Papers by Publication Year ---
print("--- Publications per Year ---")
papers_per_year = df['publish_year'].value_counts().sort_index()
print(papers_per_year)
print("-" * 35)


# --- Identify Top Journals ---
print("\n--- Top 10 Journals Publishing Research ---")
# Exclude the 'Unknown' category from the count
top_journals = df[df['journal'] != 'Unknown']['journal'].value_counts().head(10)
print(top_journals)
print("-" * 35)


# ---Find Most Frequent Words in Titles ---
print("\n--- Top 20 Most Frequent Words in Titles ---")
# Define a simple list of common English "stop words"
stop_words = set([
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'http', 'https', 'et', 'al', 'author', 'figure',
    'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'of', 'at', 'by', 'for', 'with', 'about',
    'to', 'in', 'on', 'is', 'are', 'was', 'were', 'it', 'that', 'which', 'who', 'what', 'when', 'where',
    'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
    'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now',
    'from', '-PRON-', 'using', 'study', 'we', 'our', 'results', 'data', 'analysis', 'based'
])
# Combine all titles into one large text block
all_titles = ' '.join(df['title'].astype(str).tolist())
# Find all words, convert to lowercase, and filter out stop words
words = [word for word in re.findall(r'\b\w+\b', all_titles.lower()) if word not in stop_words and len(word) > 2]
# Count the frequency of the remaining words
word_counts = Counter(words)
# Print the 20 most common words
print(word_counts.most_common(20))
print("-" * 35)

--- File Loaded Successfully ---

--- Data Preparation Complete (created 'publish_year' column) ---

--- Publications per Year ---
publish_year
1951       1
1952       1
1955       1
1957       1
1959       1
        ... 
2016    2965
2017    2911
2018    3094
2019    3144
2020    2953
Name: count, Length: 65, dtype: int64
-----------------------------------

--- Top 10 Journals Publishing Research ---
journal
Journal of Virology        1740
PLoS One                   1567
Virology                    864
Emerg Infect Dis            745
The Lancet                  596
Viruses                     565
Virus Research              495
Sci Rep                     491
Vaccine                     483
Veterinary Microbiology     443
Name: count, dtype: int64
-----------------------------------

--- Top 20 Most Frequent Words in Titles ---
[('virus', 8018), ('respiratory', 4897), ('coronavirus', 4590), ('infection', 3533), ('viral', 2892), ('human', 2756), ('protein', 2717), ('influenza', 2551),

## 2. Data Preparation and Analysis

### Objective
This section prepares the loaded `cleaned_metadata.csv` file for analysis and then performs three basic analyses: counting publications per year, identifying the top journals, and finding the most frequent words in paper titles.

### Part A: Data Preparation
The following preparation steps are performed on the DataFrame:
* The `publish_time` column is converted to a proper datetime format.
* A `publish_year` column is created by extracting the year from `publish_time` to enable time-based analysis.

---

### Part B: Basic Data Analysis
The script then performs and prints the results for the following three analyses:

* **Publications per Year:** Counts the total number of papers for each year.
* **Top 10 Journals:** Identifies the 10 journals with the highest number of publications.
* **Top 20 Words in Titles:** Extracts and counts the most common words found in paper titles after removing common stop words.

In [None]:
#VISUALIZE THE DATA
# ---  Plot 1: Number of Publications Over Time ---
plt.figure(figsize=(10, 6))
papers_per_year = df['publish_year'].value_counts().sort_index()
# Filter for a reasonable year range

#papers_per_year = papers_per_year.loc[2015:2021]
sns.lineplot(x=papers_per_year.index, y=papers_per_year.values, marker='o')
plt.title('Number of Publications Over Time', fontsize=16)
plt.xlabel('Year')
plt.ylabel('Number of Papers')
plt.grid(True)
plt.savefig('publications_over_time.png')
plt.close()
print("--- Generated: publications_over_time.png ---")


# ---  Plot 2: Top Publishing Journals ---
plt.figure(figsize=(10, 8))
top_journals = df[df['journal'] != 'Unknown']['journal'].value_counts().head(10)
sns.barplot(x=top_journals.values, y=top_journals.index, palette='viridis')
plt.title('Top 10 Publishing Journals', fontsize=16)
plt.xlabel('Number of Papers')
plt.ylabel('Journal')
plt.tight_layout()
plt.savefig('top_journals.png')
plt.close()
print("--- Generated: top_journals.png ---")


# ---  Plot 3: Word Cloud of Titles ---
# Add domain-specific stop words
custom_stop_words = set(STOPWORDS) | set(['preprint', 'doi', 'copyright', 'peer', 'reviewed', 'author'])
# Combine all titles into a single string
all_titles = ' '.join(df['title'].astype(str).tolist())
# Generate word cloud
wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white',
    stopwords=custom_stop_words,
    colormap='plasma'
).generate(all_titles)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Paper Titles', fontsize=16)
plt.savefig('titles_wordcloud.png')
plt.close()
print("--- Generated: titles_wordcloud.png ---")


# ---  Plot 4: Distribution by Source ---
plt.figure(figsize=(10, 6))
source_counts = df['source_x'].value_counts()
sns.barplot(x=source_counts.index, y=source_counts.values, palette='crest')
plt.title('Distribution of Papers by Source', fontsize=16)
plt.xlabel('Source')
plt.ylabel('Number of Papers')
plt.savefig('papers_by_source.png')
plt.close()
print("--- Generated: papers_by_source.png ---")

--- Generated: publications_over_time.png ---



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top_journals.values, y=top_journals.index, palette='viridis')


--- Generated: top_journals.png ---
--- Generated: titles_wordcloud.png ---
--- Generated: papers_by_source.png ---



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=source_counts.index, y=source_counts.values, palette='crest')


# Data Visualization of Publication Dataset

This report provides an overview of publication patterns, top journals, common research themes, and source distribution.  

---

## 1. Number of Publications Over Time  
**File:** `publications_over_time.png`

This plot displays the **trend in publications over the years**.  
Key observations:
- Identifies peaks and drops in research activity.
- Helps correlate trends with global events or advancements in the field.
- A steady increase suggests growing interest and research output.

---

## 2. Top Publishing Journals  
**File:**top_journals.png`

This chart highlights the **top 10 journals** by the number of papers published.  
Key observations:
- Reveals the most influential journals in the field.
- Useful for researchers when selecting publication venues.
- Indicates where the majority of studies are being disseminated.

---

## 3. Word Cloud of Paper Titles  
**File:** `titles_wordcloud.png`

The word cloud illustrates the **most frequently occurring keywords in paper titles**.  
Key observations:
- Larger words indicate higher frequency.
- Offers insight into **dominant research themes and focus areas**.
- Helps identify emerging topics or commonly discussed concepts.

---

## 4. Distribution by Source  
**File:** `papers_by_source.png`

This bar chart shows the **distribution of papers across different sources**.  
Key observations:
- Highlights which data sources contribute the most publications.
- Useful for understanding data diversity and coverage.
- Helps prioritize sources for future research or data collection.

---

## Overall Insights
- Research output has **grown over time**, indicating increasing interest in the domain.  
- A few journals dominate publication activity, making them key venues for researchers.  
- The most common keywords reflect **core themes and emerging trends** in the field.  
- Data sources vary widely, but a handful contribute **most of the available research papers**.
