# 📚 AI & ML Research Paper and Book Recommender Dataset Project

---

## 🔹 Project Overview
The purpose of this project is to **create a comprehensive dataset of research papers and books** in the fields of **Machine Learning (ML), Deep Learning (DL), Artificial Intelligence (AI), Natural Language Processing (NLP), Data Science, and related areas**.  
This dataset will later serve as the foundation for building a **recommendation system** that can suggest highly relevant and impactful resources to learners, researchers, and professionals.

---

## 🔹 Motivation
With the rapid growth of AI/ML research and publications, it is increasingly difficult for students, researchers, and practitioners to **identify high-quality and relevant resources** efficiently.  
A curated dataset of books and research papers enables:  
- Quick discovery of important research papers and books  
- Recommendations based on relevance, citations, recency, and topic  
- Support for personalized learning and knowledge expansion  

---

## 🔹 Project Goals
1. **Fetch books** related to ML, DL, AI, NLP, and Data Science from the **Google Books API**.  
2. **Fetch research papers** in the same domains from the **Semantic Scholar API**.  
3. **Store data** in both **raw JSON** and **refined CSV** formats for easy analysis and use.  
4. **Include key information**:  
   - Title, Abstract, Authors, Publication Year, Citations, Venue/Publisher, URL  
5. **Prepare the dataset** for future use in building a **recommendation system** combining books and research papers.  

---

## 🔹 Key Features of the Dataset
- Covers multiple **ML/AI/NLP categories** with a wide range of resources  
- Includes **academic impact signals** (citation counts for papers, ratings for books)  
- Handles **duplicates and filtering** to ensure quality  
- Easy to use for **data analysis, visualization, and recommender systems**  

---

## 🔹 Tools & APIs Used
- **Python** for data fetching, processing, and storage  
- **Google Books API** to fetch books data  
- **Semantic Scholar API** to fetch research papers data  
- **Pandas** for refining and saving datasets  
- **JSON & CSV** for storing data in structured and human-readable formats  

---

## 🔹 Future Scope
- Merge book and paper datasets for a **unified recommendation engine**  
- Rank resources by **relevance, citations, recency, and ratings**  
- Build **personalized ML/NLP learning recommendations**  
- Expand dataset with **more topics and research domains**  

---

## 🔹 Notebook Structure
1. **Introduction & Motivation**  
2. **Setup & Library Imports**  
3. **Define Functions to Fetch & Save Data**  
4. **Define Queries for Books & Papers**  
5. **Fetch Data from APIs**  
6. **Save Raw JSON & Refined CSV**  
7. **Preview & Analyze Dataset**  
8. **Next Steps for Recommender System**


# Importing the Required Tools

In [None]:
# Fetching data from api
import requests
import json
import time
import json


# Data Exploration 
import pandas as pd




# Loading the Data

## Loading the books info from Google Books Api

In [None]:
# API_KEY="My_API_KEY"
# queries = [
#     'intitle:"machine learning"',
#     'intitle:"deep learning"',
#     'intitle:"artificial intelligence"',
#     'intitle:"natural language processing"',
#     'intitle:"data science"',
#     'intitle:"computer vision"',
#     'intitle:"reinforcement learning"',
#     'intitle:"ML algorithms"',
#     'intitle:"AI research"',
#     'intitle:"neural networks"'
# ]

# all_books = []

# for q in queries:
#     for start in range(0,400,40):
#         url= url = f"https://www.googleapis.com/books/v1/volumes?q={q}&maxResults=40&startIndex={start}&key={API_KEY}"
#         response=requests.get(url)
#         data=response.json()
#         items=data.get("items",[])
#         all_books.extend(items)
#         time.sleep(0.8)
        

# with open("data/Ml_books.json","w",encoding="utf=8") as f:
#     json.dump(all_books,f,ensure_ascii=False,indent=4)
    
# print(f"Fetched {len(all_books)} books and saved to data/Ml_books.json")

Fetched 1335 books and saved to data/Ml_books.json


In [None]:
# # Loading the saved Json file
# with open("data/Ml_books.json","r") as f:
#     all_books=json.load(f)

# # Relevant key words
# ml_keywords = [
#     "machine learning",
#     "deep learning",
#     "artificial intelligence",
#     "natural language processing",
#     "nlp",
#     "data science",
#     "computer vision",
#     "reinforcement learning",
#     "ml algorithms",
#     "ai research",
#     "neural networks"
# ]

# # Refining the data with relevent key words
# filtered_books = []
# for item in all_books:
#     title = item["volumeInfo"].get("title", "").lower()
#     desc = item["volumeInfo"].get("description", "").lower()
#     if any(k in title or k in desc for k in ml_keywords):
#         filtered_books.append(item)


In [None]:
# # Saving the CSV Dataset with relevent features
# books_list = []
# for item in filtered_books:
#     info = item["volumeInfo"]
#     books_list.append({
#         "title": info.get("title"),
      
#         "authors": ", ".join(info.get("authors", [])),
#         "description": info.get("description", ""),
#         "categories": ", ".join(info.get("categories", [])),
#         "publisher":info.get('publisher',[]),
#         "publishedDate": info.get("publishedDate", ""),
#         "avgrating":info.get("averageRating", 0),
#         "pagecount":info.get("pageCount",0),
 
#     })

# df = pd.DataFrame(books_list)
# df.to_csv("data/ml_books.csv", index=False)
# print("Saved filtered books to data/ml_books.csv")


Saved filtered books to data/ml_books.csv


## Loading the Research papers from Semanticscholar.org

In [None]:
# # Function to fetch the papers
# def fetch_papers(query, limit=100, max_results=300):
#     """
#     Fetch research papers from Semantic Scholar API.
#     query: search keyword (e.g., "machine learning")
#     limit: results per API call (max 100)
#     max_results: total number of results to fetch
#     """
#     papers = []
#     base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
#     fields = "title,abstract,authors,url,year,citationCount,venue"

#     for offset in range(0, max_results, limit):
#         url = f"{base_url}?query={query}&limit={limit}&offset={offset}&fields={fields}"
#         response = requests.get(url)

#         if response.status_code != 200:
#             print("Error fetching", query, ":", response.status_code)
#             break

#         data = response.json()
#         items = data.get("data", [])
#         if not items:
#             break

#         # Attach query label for reference
#         for item in items:
#             item["searchQuery"] = query
#         papers.extend(items)

#         time.sleep(1)  # avoid hitting API too fast

#     return papers


In [None]:
# #  Function to save results
# def save_papers(papers, json_file="papers.json", csv_file="papers.csv"):
#     """
#     Save papers to JSON (raw) and CSV (refined).
#     """
#     # Save raw JSON
#     with open(json_file, "w", encoding="utf-8") as f:
#         json.dump(papers, f, indent=4, ensure_ascii=False)

#     # Refine for CSV
#     refined = []
#     for p in papers:
#         refined.append({
#             "SearchQuery": p.get("searchQuery", ""),
#             "Title": p.get("title", ""),
#             "Abstract": p.get("abstract", ""),
#             "Authors": ", ".join([a.get("name", "") for a in p.get("authors", [])]),
#             "Year": p.get("year", ""),
#             "Citations": p.get("citationCount", 0),
#             "Venue": p.get("venue", ""),
#             "URL": p.get("url", "")
#         })

#     df = pd.DataFrame(refined)
#     df.to_csv(csv_file, index=False, encoding="utf-8")
#     print(f"Saved {len(refined)} papers → {json_file}, {csv_file}")


In [None]:
# # Loading the papers with Relevant key words
# queries = [
#     "machine learning",
#     "deep learning",
#     "artificial intelligence",
#     "natural language processing",
#     "nlp",
#     "computer vision",
#     "reinforcement learning",
#     "data science"
# ]

# all_papers = []
# for q in queries:
#     print(f"Fetching papers for: {q}")
#     papers = fetch_papers(q, limit=100, max_results=300) 
#     all_papers.extend(papers)

# save_papers(all_papers, json_file="data/all_papers.json", csv_file="data/all_papers.csv")


Fetching papers for: machine learning
Error fetching machine learning : 429
Fetching papers for: deep learning
Error fetching deep learning : 429
Fetching papers for: artificial intelligence
Error fetching artificial intelligence : 429
Fetching papers for: natural language processing
Fetching papers for: nlp
Error fetching nlp : 429
Fetching papers for: computer vision
Error fetching computer vision : 429
Fetching papers for: reinforcement learning
Error fetching reinforcement learning : 429
Fetching papers for: data science
Error fetching data science : 429
Saved 600 papers → data/all_papers.json, data/all_papers.csv


# 🔍 Data Inspection & Preview

---

## Purpose
Before using the dataset for analysis or building a recommendation system, it is important to **explore and inspect the data** to ensure:  
- Data was fetched correctly from the APIs  
- All key fields (Title, Abstract, Authors, Year, Citations, URL) are present  
- There are no duplicates or missing values  
- The dataset covers all intended categories  

In [38]:
# BOOKS DataFrame
df_books=pd.read_csv("data/ml_books.csv")
df_books.head()

Unnamed: 0,title,authors,description,categories,publisher,publishedDate,avgrating,pagecount
0,Python Machine Learning,"Sebastian Raschka, Vahid Mirjalili",Applied machine learning with a solid foundati...,Computers,Packt Publishing Ltd,2019-12-12,0.0,771
1,Introduction to Machine Learning,Ethem Alpaydin,An introductory text in machine learning that ...,Computers,MIT Press,2004,4.0,468
2,Understanding Machine Learning,"Shai Shalev-Shwartz, Shai Ben-David",Introduces machine learning and its algorithmi...,Computers,Cambridge University Press,2014-05-19,5.0,415
3,"Introduction to Machine Learning, fourth edition",Ethem Alpaydin,A substantially revised fourth edition of a co...,Computers,MIT Press,2020-03-24,0.0,709
4,"Hands-On Machine Learning with Scikit-Learn, K...",Aurélien Géron,"Through a series of recent breakthroughs, deep...",Computers,O'Reilly Media,2019-09-05,0.0,851


In [39]:
# PAPER DataFrame
df_paper=pd.read_csv("data/all_papers.csv")
df_paper.head()

Unnamed: 0,SearchQuery,Title,Abstract,Authors,Year,Citations,Venue,URL
0,artificial intelligence,Peeking Inside the Black-Box: A Survey on Expl...,At the dawn of the fourth industrial revolutio...,"Amina Adadi, M. Berrada",2018.0,4078,IEEE Access,https://www.semanticscholar.org/paper/21dff47a...
1,artificial intelligence,High-performance medicine: the convergence of ...,,E. Topol,2019.0,4910,Nature Network Boston,https://www.semanticscholar.org/paper/f134abea...
2,artificial intelligence,Sparks of Artificial General Intelligence: Ear...,Artificial intelligence (AI) researchers have ...,"Sébastien Bubeck, Varun Chandrasekaran, Ronen ...",2023.0,3388,arXiv.org,https://www.semanticscholar.org/paper/8dbd5746...
3,artificial intelligence,Explainable Artificial Intelligence (XAI): Con...,,"Alejandro Barredo Arrieta, Natalia Díaz Rodríg...",2019.0,6788,Information Fusion,https://www.semanticscholar.org/paper/530a059c...
4,artificial intelligence,Explanation in Artificial Intelligence: Insigh...,,Tim Miller,2017.0,4511,Artificial Intelligence,https://www.semanticscholar.org/paper/e89dfa30...


### DataFrame Informations

In [43]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296 entries, 0 to 1295
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1296 non-null   object 
 1   authors        1273 non-null   object 
 2   description    1180 non-null   object 
 3   categories     1187 non-null   object 
 4   publisher      1296 non-null   object 
 5   publishedDate  1283 non-null   object 
 6   avgrating      1296 non-null   float64
 7   pagecount      1296 non-null   int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 81.1+ KB


In [41]:
df_paper.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SearchQuery  600 non-null    object 
 1   Title        600 non-null    object 
 2   Abstract     381 non-null    object 
 3   Authors      595 non-null    object 
 4   Year         596 non-null    float64
 5   Citations    600 non-null    int64  
 6   Venue        549 non-null    object 
 7   URL          600 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 37.6+ KB


In [53]:

# Combine numeric descriptions side by side
desc_books = df_books.describe()
desc_papers = df_paper.describe()

combined_desc = pd.concat([desc_books, desc_papers], axis=1, keys=["Books", "Papers"])
display(combined_desc)

Unnamed: 0_level_0,Books,Books,Papers,Papers
Unnamed: 0_level_1,avgrating,pagecount,Year,Citations
count,1296.0,1296.0,596.0,600.0
mean,0.34375,366.627315,2018.90604,1052.273333
std,1.209872,288.636565,5.232211,4890.785858
min,0.0,0.0,1980.0,0.0
25%,0.0,203.0,2018.0,116.75
50%,0.0,318.0,2020.0,267.5
75%,0.0,479.25,2022.0,590.5
max,5.0,3296.0,2025.0,99369.0


### Checking the Null values

In [59]:
df_books.isna().sum()

title              0
authors           23
description      116
categories       109
publisher          0
publishedDate     13
avgrating          0
pagecount          0
dtype: int64

In [60]:
df_paper.isna().sum()

SearchQuery      0
Title            0
Abstract       219
Authors          5
Year             4
Citations        0
Venue           51
URL              0
dtype: int64