## <center> Web-Scraping Project 6.0 </center>

Scraping the following details of top publications from Google Scholar from <a>https://scholar.google.com/citations?view_op=top_venues&hl=en</a>
<ol>
    <li>Rank</li>
    <li>Publication</li>
    <li>h5-index</li>
    <li>h5-median</li>
</ol>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
url = "https://scholar.google.com/citations?view_op=top_venues&hl=en"
gs_html_data = requests.get(url).text
gs_soup = BeautifulSoup(gs_html_data, "html.parser")

In [3]:
publication = [tags.text for tags in gs_soup.find_all("td", class_="gsc_mvt_t")]
publication

['Nature',
 'The New England Journal of Medicine',
 'Science',
 'IEEE/CVF Conference on Computer Vision and Pattern Recognition',
 'The Lancet',
 'Advanced Materials',
 'Nature Communications',
 'Cell',
 'International Conference on Learning Representations',
 'Neural Information Processing Systems',
 'JAMA',
 'Chemical Reviews',
 'Proceedings of the National Academy of Sciences',
 'Angewandte Chemie',
 'Chemical Society Reviews',
 'Journal of the American Chemical Society',
 'IEEE/CVF International Conference on Computer Vision',
 'Nucleic Acids Research',
 'International Conference on Machine Learning',
 'Nature Medicine',
 'Renewable and Sustainable Energy Reviews',
 'Science of The Total Environment',
 'Advanced Energy Materials',
 'Journal of Clinical Oncology',
 'ACS Nano',
 'Journal of Cleaner Production',
 'Advanced Functional Materials',
 'Physical Review Letters',
 'Scientific Reports',
 'The Lancet Oncology',
 'Energy & Environmental Science',
 'IEEE Access',
 'PLoS ONE',
 '

In [4]:
index = [tags.text for tags in gs_soup.select("td > a")]
index

['444',
 '432',
 '401',
 '389',
 '354',
 '312',
 '307',
 '300',
 '286',
 '278',
 '267',
 '265',
 '256',
 '245',
 '244',
 '242',
 '239',
 '238',
 '237',
 '235',
 '227',
 '225',
 '220',
 '213',
 '211',
 '211',
 '210',
 '207',
 '206',
 '202',
 '202',
 '200',
 '198',
 '197',
 '195',
 '192',
 '191',
 '190',
 '189',
 '186',
 '183',
 '181',
 '181',
 '180',
 '178',
 '177',
 '175',
 '173',
 '173',
 '173',
 '172',
 '170',
 '169',
 '167',
 '166',
 '165',
 '165',
 '165',
 '165',
 '164',
 '164',
 '163',
 '163',
 '163',
 '163',
 '162',
 '160',
 '160',
 '159',
 '159',
 '159',
 '159',
 '158',
 '158',
 '155',
 '155',
 '155',
 '155',
 '155',
 '154',
 '153',
 '153',
 '152',
 '152',
 '152',
 '152',
 '152',
 '152',
 '151',
 '151',
 '150',
 '149',
 '149',
 '146',
 '146',
 '145',
 '145',
 '145',
 '144',
 '144']

In [5]:
median = [tags.text for tags in gs_soup.select(".gsc_mvt_n > span")]
median

['667',
 '780',
 '614',
 '627',
 '635',
 '418',
 '428',
 '505',
 '533',
 '436',
 '425',
 '444',
 '364',
 '332',
 '386',
 '344',
 '415',
 '550',
 '421',
 '389',
 '324',
 '311',
 '300',
 '315',
 '277',
 '273',
 '280',
 '294',
 '274',
 '329',
 '290',
 '303',
 '278',
 '294',
 '276',
 '246',
 '297',
 '307',
 '301',
 '321',
 '253',
 '265',
 '224',
 '296',
 '220',
 '223',
 '315',
 '296',
 '228',
 '217',
 '232',
 '314',
 '304',
 '234',
 '254',
 '296',
 '293',
 '243',
 '229',
 '231',
 '207',
 '302',
 '265',
 '264',
 '220',
 '248',
 '263',
 '220',
 '304',
 '243',
 '214',
 '211',
 '242',
 '214',
 '340',
 '235',
 '217',
 '212',
 '194',
 '249',
 '278',
 '211',
 '292',
 '233',
 '228',
 '225',
 '222',
 '214',
 '225',
 '222',
 '196',
 '205',
 '202',
 '201',
 '190',
 '233',
 '209',
 '201',
 '228',
 '212']

In [6]:
gogl_schl = pd.DataFrame(list(zip(publication, index, median)), columns=["Publication", "h5-index", "h5-median"], index = range(1,101))
gogl_schl.rename_axis("Rank", inplace=True)
gogl_schl

Unnamed: 0_level_0,Publication,h5-index,h5-median
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Nature,444,667
2,The New England Journal of Medicine,432,780
3,Science,401,614
4,IEEE/CVF Conference on Computer Vision and Pat...,389,627
5,The Lancet,354,635
...,...,...,...
96,Journal of Business Research,145,233
97,Molecular Cancer,145,209
98,Sensors,145,201
99,Nature Climate Change,144,228


In [8]:
gogl_schl["h5-index"] = gogl_schl["h5-index"].astype(int)
gogl_schl["h5-median"] = gogl_schl["h5-median"].astype(int)
gogl_schl.dtypes

Publication    object
h5-index        int32
h5-median       int32
dtype: object

In [9]:
gogl_schl.to_excel("Top Publications.xlsx")