# **Aerial Image Analysis Research Papers**

<center>
<img src="https://imgur.com/f5L9s74.jpg" width="800" height="400" >
</center>

Every researcher needs an excellent platform where he can freely access the literature of various areas. Hence Google scholar is one of the best freely accessible search engines, which provides a wide variety of published literature in the form of articles, research papers, etc.

But sometimes, over choice leads to confusion. So **Web Scraping** is a method that provides a way to collect information from the website in a meaningful manner based on our interests.

In this project, we collect information about the research papers related to aerial image analysis. The implementation of this project will use the python library and Beautiful soup. 

We will collect the information in the form of :
1. Title of the paper
2. Number of citation
3. Author of the paper
4. Year of Publication
5. Place of Publication

We will store this information in the dictionary. And this dataset will be save as in a tabular database, in CSV format, and can be helpful for the literature survey in this area.

In [2]:
!pip install jovian --upgrade --quiet

# Libraries 


In [1]:
#import the liberary
import requests
from time import sleep 
import re
import pandas as pd
from bs4 import BeautifulSoup

# Define Header 

Here we define the header which will help to scrape those webpage which required login.

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}

# Define Functions



In [3]:
def get_paperinfo(paper_url):
    response=requests.get(url,headers=headers)
    
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ')

    paper_doc = BeautifulSoup(response.text,'html.parser')

    return paper_doc

In [5]:
def get_tags(doc):
    paper_tag = doc.select('[data-lid]')
    cite_tag = doc.select('[title=Cite] + a')
    link_tag = doc.find_all('h3',{"class" : "gs_rt"})
    author_tag = doc.find_all("div", {"class": "gs_a"})
    
    return paper_tag,cite_tag,link_tag,author_tag


In [7]:
def get_papertitle(paper_tag):
    paper_names = []
  
    for tag in paper_tag:
    paper_names.append(tag.select('h3')[0].get_text())

  return paper_names

In [8]:
# it will return the number of citation of the paper
def get_citecount(cite_tag):
  cite_count = []
  for i in cite_tag:
    cite = i.text
    if i is None or cite is None:  # if paper has no citatation then consider 0
      cite_count.append(0)
    else:
      tmp = re.search(r'\d+', cite) # its handle the None type object error and re use to remove the string " cited by " and return only integer value
      if tmp is None :
        cite_count.append(0)
      else :
        cite_count.append(int(tmp.group()))

  return cite_count

In [9]:
# function for the getting link information
def get_link(link_tag):

  links = []

  for i in range(len(link_tag)) :
    links.append(link_tag[i].a['href']) 

  return links 

In [10]:
# function for the getting autho , year and publication information
def get_author_year_publi_info(authors_tag):
  years = []
  publication = []
  authors = []
  for i in range(len(authors_tag)):
      authortag_text = (authors_tag[i].text).split()
      year = int(re.search(r'\d+', authors_tag[i].text).group())
      years.append(year)
      publication.append(authortag_text[-1])
      author = authortag_text[0] + ' ' + re.sub(',','', authortag_text[1])
      authors.append(author)
  
  return years , publication, authors


# Store the information in dictonary

In [11]:
# creating final repository
paper_repos_dict = {
                    'Paper Title' : [],
                    'Year' : [],
                    'Author' : [],
                    'Citation' : [],
                    'Publication' : [],
                    'Url of paper' : [] }

# adding information in repository
def add_in_paper_repo(papername,year,author,cite,publi,link):
  paper_repos_dict['Paper Title'].extend(papername)
  paper_repos_dict['Year'].extend(year)
  paper_repos_dict['Author'].extend(author)
  paper_repos_dict['Citation'].extend(cite)
  paper_repos_dict['Publication'].extend(publi)
  paper_repos_dict['Url of paper'].extend(link)

  return pd.DataFrame(paper_repos_dict)

# Iterating over the each page of **Google Scholar** 

Here we are scraping the total 10 pages of data. We can scrape more .For this we need to change the number in the range.

In [12]:
for i in range (0,110,10):

  # get url for the each page
  url = "https://scholar.google.com/scholar?start={}&q=object+detection+in+aerial+image+&hl=en&as_sdt=0,5".format(i)

  # function for the get content of each page
  doc = get_paperinfo(url)

  # function for the collecting tags
  paper_tag,cite_tag,link_tag,author_tag = get_tags(doc)
  
  # paper title from each page
  papername = get_papertitle(paper_tag)

  # year , author , publication of the paper
  year , publication , author = get_author_year_publi_info(author_tag)

  # cite count of the paper 
  cite = get_citecount(cite_tag)

  # url of the paper
  link = get_link(link_tag)

  # add in paper repo dict
  final = add_in_paper_repo(papername,year,author,cite,publication,link)
  
  # use sleep to avoid status code 429
  sleep(30)


# Display of the dataset

In [13]:
len(final)

110

In [14]:
final[:10]

Unnamed: 0,Paper Title,Year,Author,Citation,Publication,Url of paper
0,DOTA: A large-scale dataset for object detecti...,2018,GS Xia,624,openaccess.thecvf.com,http://openaccess.thecvf.com/content_cvpr_2018...
1,Convolutional neural network based automatic o...,2016,I Ševo,143,ieeexplore.ieee.org,https://ieeexplore.ieee.org/abstract/document/...
2,Orientation robust object detection in aerial ...,2015,H Zhu,141,ieeexplore.ieee.org,https://ieeexplore.ieee.org/abstract/document/...
3,Clustered object detection in aerial images,2019,F Yang,54,openaccess.thecvf.com,http://openaccess.thecvf.com/content_ICCV_2019...
4,Patch-level augmentation for object detection ...,2019,S Hong,11,openaccess.thecvf.com,http://openaccess.thecvf.com/content_ICCVW_201...
5,Axis learning for orientated objects detection...,2020,Z Xiao,16,mdpi.com,https://www.mdpi.com/663030
6,Feature extraction by rotation-invariant matri...,2017,G Wang,39,ieeexplore.ieee.org,https://ieeexplore.ieee.org/abstract/document/...
7,Density map guided object detection in aerial ...,2020,C Li,11,openaccess.thecvf.com,http://openaccess.thecvf.com/content_CVPRW_202...
8,Adaptive anchor for fast object detection in a...,2019,R Jin,6,ieeexplore.ieee.org,https://ieeexplore.ieee.org/abstract/document/...
9,Learning roi transformer for oriented object d...,2019,J Ding,129,openaccess.thecvf.com,http://openaccess.thecvf.com/content_CVPR_2019...


# Storing in the csv file

In [15]:
final.to_csv('aerial_image_reserachpapers.csv', sep=',', index=False,header=True)

# Summary

In this project, we accumulated information about research papers in the area of aerial images from the website (https://scholar.google.com/).

![](https://imgur.com/KYojxqI.jpg)

The metadata gathered about the various published research papers title, in which year and which journal/conference these published, who are the authors, and link of the papers. 


To scrape these pieces of information, we have taken the followings steps :
1. Scraped the web page content using the Beautiful Soup library.
2. Define the function for the these things :
    - Extracting tags for relevant parts
    - Using tag , extract data for the paper title name , authors , year , number of citations etc. 
3. Iterate above steps for the each web page and collect data from 10 pages.
4. Store this dataset in the format of csv.

# Future Works

The current web-scraping project is collecting all the papers of the area in aerial image analysis. It can be more generalized if we scrape papers for any defined area. From this scraped data, we can visualize the research area that which papers get more citations, how much work is happening in this area, what are the top papers along with top conferences/journals. This data can also help to make a detailed literature survey on the specific area.

# Refrences



1.  https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
2.  https://docs.python-requests.org/en/master/
3. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
4. https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.htm
  



In [None]:
jovian.commit(project="dataanalyst-bootcamp-project1-web-scraping", outputs=['aerial_image_reserachpapers.csv'])

[jovian] Detected Colab notebook...[0m
[jovian] Please enter your API key ( from https://jovian.ai/ ):[0m
API KEY: 