# Nature'dan anlık web-scraping ile makale çekilmesi

### Web-scraping için gerekli kütüphanelerin çekilmesi

In [1]:
# kütüphaneleri import etme
from bs4 import BeautifulSoup
import requests
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Nature Sitesinden Web-Scraping ile veri çekme

### Şimdilik bir kaç makale için web-scraping

In [2]:
page = requests.get("https://www.nature.com/")
soup = BeautifulSoup(page.content, "html.parser")

In [3]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html class="grade-c" lang="en">
<head>
<title>Nature</title>
<link as="font" crossorigin="" href="/static/fonts/HardingText-Regular-Web-cecd90984f.woff2" rel="preload" type="font/woff2"/>
<link crossorigin="" href="https://push-content.springernature.io" rel="preconnect"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="pc,mobile" name="applicable-device"/>
<meta content="width=device-width,initial-scale=1.0,maximum-scale=5,user-scalable=yes" name="viewport"/>
<script data-test="dataLayer">
    window.dataLayer = [{"content":{"category":{"contentType":null,"legacy":null},"article":null,"attributes":{"cms":null,"deliveryPlatform":null,"copyright":null},"contentInfo":null,"journal":{"pcode":"nature","title":"Nature","volume":null,"issue":null},"authorization":{"status":true},"features":[],"collection":null},"page":{"category":{"pageType":"journal"},"attributes":{"template":"mosaic","featureFlags":[{"name":"nature-onwar

In [5]:
# Populer makale başlığı
soup.find("h2", class_="c-hero__title").text

'\nFlu, MERS and Ebola — the disease outbreaks most frequently reported\n'

In [6]:
# Populer makale linki
soup.find("h2", class_="c-hero__title").a.get("href")

'https://www.nature.com/articles/d41586-023-00196-w'

In [7]:
# Makale başlığı
soup.find("h3", class_="c-card__title").text

'\nWeird supernova remnant blows scientists’ minds\n'

In [8]:
# Makale alt linki
soup.find("h3", class_="c-card__title").a.get("href")

'/articles/d41586-023-00202-1'

In [9]:
# Makalenin asıl linki
link = "https://www.nature.com" + soup.find("h3", class_="c-card__title").a.get("href")
link

'https://www.nature.com/articles/d41586-023-00202-1'

In [10]:
# Asıl makale linkine bağlanma
page2 = requests.get("https://www.nature.com/articles/d41586-023-00196-w")
soup2 = BeautifulSoup(page2.content, "html.parser")

In [11]:
# Makale linkinden makale başlığı
soup2.find("h1", class_="c-article-magazine-title").text

'Flu, MERS and Ebola — the disease outbreaks most frequently reported'

In [35]:
# Linkteki genel makale ile ilgili metinler
soup2.find("article", class_="article-item article-item--open").text

'\n\n\n\n\nNEWS\n27 January 2023\n\nFlu, MERS and Ebola — the disease outbreaks most frequently reported\n\n\n                    The World Health Organization’s disease reports reflect public-health priorities and surveillance capabilities.\n                \n\n\n\n\n\n                Sara Reardon\n\n\n\n\nSara Reardon\n\n\nView author publications\n\nYou can also search for this author in PubMed\n\xa0Google Scholar\n\n\n\n\n\n\n\n\n\n\n\n\n\nTwitter\n\n\n\n\n\nFacebook\n\n\n\n\n\nEmail\n\n\n\n\n\n\n\n\nYou have full access to this article via your institution.\n\n\nDownload PDF\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMiddle East respiratory syndrome, thought to be transmitted by camels, was a commonly reported infectious disease between 1996 and 2019.Credit: Fayez Nureldine/AFP via Getty\n\n\nGlobally, influenza has been responsible for more outbreaks than any other infectious disease over the past 23 years, followed by Middle East respiratory syndrome (MERS) and Ebola, finds an analysis of

In [13]:
# Asıl makale içeriği NOT: Her zaman makale içeriğinin tamamına ulaşılmaz. Bazen sadece makale özeti bulunur.
soup2.find("div", class_="c-article-body main-content").text

'\n\n\n\n\n\nMiddle East respiratory syndrome, thought to be transmitted by camels, was a commonly reported infectious disease between 1996 and 2019.Credit: Fayez Nureldine/AFP via Getty\n\n\nGlobally, influenza has been responsible for more outbreaks than any other infectious disease over the past 23 years, followed by Middle East respiratory syndrome (MERS) and Ebola, finds an analysis of disease reports by the World Health Organization (WHO)1. The study also reveals the subjective way in which disease outbreaks are often reported, suggesting that this can affect how resources are allocated.Public-health authorities use several data sources to track infectious-disease outbreaks, but the WHO’s Disease Outbreak News (DON) is one of the most influential. Global-health researcher Rebecca Katz at Georgetown University in Washington DC and her colleagues collected all 2,789 DON reports issued between 1996 and 2019 in a searchable database. The database includes the metadata pulled from eac

In [14]:
# Makale yayın tarihi
soup2.find("time").get("datetime")

'2023-01-27'

### Şimdi döngü halinde diğer makaleleri elde etme adımları

In [37]:
# örnek makale linki
soup.find("h2", class_="c-hero__title").a.get("href")

'https://www.nature.com/articles/d41586-023-00196-w'

In [38]:
# linkler için boş liste tanımlama
linkler = []

In [17]:
# populer makale linkini listeye ekleme
linkler.append(soup.find("h2", class_="c-hero__title").a.get("href"))

In [19]:
soup.find_all("h3", class_="c-card__title")[0].a.get("href")

'/articles/d41586-023-00202-1'

In [20]:
# Populer makale dışındaki  makalelerin linklerinin hepsini altlink olarak links değişkenine atama
links = soup.find_all("h3", class_="c-card__title")

In [21]:
# döngü aracılığıyla asıl linkleri linkler listesine ekleme
for i in links:
    link = "https://www.nature.com" + i.a.get("href")
    linkler.append(link)

In [22]:
# listeye bi gö atma
linkler

['https://www.nature.com/articles/d41586-023-00196-w',
 'https://www.nature.com/articles/d41586-023-00202-1',
 'https://www.nature.com/articles/d41586-023-00212-z',
 'https://www.nature.com/articles/s41586-022-05637-6',
 'https://www.nature.com/articles/d41586-023-00250-7',
 'https://www.nature.com/articles/d41586-023-00247-2',
 'https://www.nature.com/articles/d41586-023-00196-w',
 'https://www.nature.com/articles/d41586-023-00246-3',
 'https://www.nature.com/articles/d41586-023-00234-7',
 'https://www.nature.com/articles/d41586-023-00197-9',
 'https://www.nature.com/articles/d41586-023-00257-0',
 'https://www.nature.com/articles/d41586-023-00202-1',
 'https://www.nature.com/articles/d41586-023-00143-9',
 'https://www.nature.com/articles/d41586-023-00095-0',
 'https://www.nature.com/articles/d41586-023-00121-1',
 'https://www.nature.com/articles/d41586-023-00145-7',
 'https://www.nature.com/articles/d41586-022-04535-1',
 'https://www.nature.com/articles/d41586-023-00054-9',
 'https://

In [23]:
# Makale bilgilerini bazı düzenlemeler yaparak for döngüsü yardımıyla bir data veri çerçevesine aktarma
tarihler = []
basliklar = []
yazilar = []
for link in linkler:
    page3 = requests.get(link)
    soup3 = BeautifulSoup(page3.content, "html.parser")
    try:
        baslik = soup3.find("h1", class_="c-article-magazine-title").text
    except AttributeError:
        try:
            baslik = soup3.find("h1", class_="c-article-title").text
        except AttributeError:
            baslik = soup3.find("h1").text
    try:
        yazi = soup3.find("div", class_="c-article-body main-content").text
    except AttributeError:
        try:
            yazi = soup3.find("div", class_="c-article-body").text
        except AttributeError:
            try:
                yazi = soup3.find("p", class_="article__teaser").text
            except AttributeError:
                yazi = soup3.find("div", class_="u-mb-32").text
    tarih = soup3.find("time").get("datetime")
    tarihler.append(tarih)
    basliklar.append(baslik)
    yazilar.append(yazi)
    data = pd.DataFrame(columns=["Tarih","Başlık","Yazı"])
    data["Tarih"] = tarihler
    data["Başlık"] = basliklar
    data["Yazı"] = yazilar
    data["Yazı"] = data["Yazı"].str.replace("\n","")
    data["Yazı"] = data["Yazı"].str.replace("\t","")
    data['Tarih']=pd.to_datetime(data['Tarih'])

In [24]:
# data yapısı
data.shape

(27, 3)

In [25]:
data

Unnamed: 0,Tarih,Başlık,Yazı
0,2023-01-27,"Flu, MERS and Ebola — the disease outbreaks mo...","Middle East respiratory syndrome, thought to b..."
1,2023-01-26,Weird supernova remnant blows scientists’ minds,"When dying stars explode as supernovae, they u..."
2,2023-01-27,Will a new wave of RSV vaccines stop the dange...,Respiratory syncytial virus infects the lungs ...
3,2023-01-25,Insulin-regulated serine and lipid metabolism ...,AbstractDiabetes represents a spectrum of dise...
4,2023-01-26,Daily briefing: How antidepressants help bacte...,"Hello Nature readers, would you like to get th..."
5,2023-01-27,"The history of profit, and are animals creativ...",Tutankhamun and the Tomb That Changed the Worl...
6,2023-01-27,"Flu, MERS and Ebola — the disease outbreaks mo...","Middle East respiratory syndrome, thought to b..."
7,2023-01-27,Astrophysicists turn fast radio bursts into co...,The Deep Synoptic Array in California's Owens ...
8,2023-01-27,Should COVID vaccines be given yearly? Proposa...,A health-care worker prepares a booster dose o...
9,2023-01-27,CRISPR voles can’t detect ‘love hormone’ oxyto...,Prairie voles (Microtus ochrogaster) are known...


In [26]:
# Başlık sütununda tekrar edenleri ayıklıyoruz  ki aynı makale birden fazla çekilmesin
data.drop_duplicates(subset="Başlık", inplace = True)

In [27]:
# yeni data yapısı
data.shape

(25, 3)

In [28]:
data.loc[20]

Tarih                                   2023-01-24 00:00:00
Başlık    From the archive: calculating the duration of ...
Yazı                                          100 years ago
Name: 20, dtype: object

In [29]:
data.Yazı[20]

'100 years ago'

In [30]:
data.Başlık[20]

'From the archive: calculating the duration of a dream, and tracking twinkling'

In [31]:
data.drop(20, inplace =True) # 20. satırda bulunan gereksin olan satırı dilme

In [32]:
# Satır silme sonrası index numaralarını resetleme
data.reset_index(drop=True, inplace=True)
data

Unnamed: 0,Tarih,Başlık,Yazı
0,2023-01-27,"Flu, MERS and Ebola — the disease outbreaks mo...","Middle East respiratory syndrome, thought to b..."
1,2023-01-26,Weird supernova remnant blows scientists’ minds,"When dying stars explode as supernovae, they u..."
2,2023-01-27,Will a new wave of RSV vaccines stop the dange...,Respiratory syncytial virus infects the lungs ...
3,2023-01-25,Insulin-regulated serine and lipid metabolism ...,AbstractDiabetes represents a spectrum of dise...
4,2023-01-26,Daily briefing: How antidepressants help bacte...,"Hello Nature readers, would you like to get th..."
5,2023-01-27,"The history of profit, and are animals creativ...",Tutankhamun and the Tomb That Changed the Worl...
6,2023-01-27,Astrophysicists turn fast radio bursts into co...,The Deep Synoptic Array in California's Owens ...
7,2023-01-27,Should COVID vaccines be given yearly? Proposa...,A health-care worker prepares a booster dose o...
8,2023-01-27,CRISPR voles can’t detect ‘love hormone’ oxyto...,Prairie voles (Microtus ochrogaster) are known...
9,2023-01-27,Stricter US guidelines for ‘gain-of-function’ ...,Some scientists would like more rigorous overs...


In [33]:
# örnek makale başlığı
data.loc[1]

Tarih                                   2023-01-26 00:00:00
Başlık      Weird supernova remnant blows scientists’ minds
Yazı      When dying stars explode as supernovae, they u...
Name: 1, dtype: object

In [34]:
# örnek makale metni
data.Yazı[1]

'When dying stars explode as supernovae, they usually eject a chaotic web of dust and gas. But a new image of a supernova’s remains looks completely different — as though its central star sparked a cosmic fireworks display. It is the most unusual remnant that researchers have ever found, and could point to a rare type of supernova that astronomers have long struggled to explain.“I have worked on supernova remnants for 30 years, and I’ve never seen anything like this,” says Robert Fesen, an astronomer at Dartmouth College in Hanover, New Hampshire, who imaged the remnant late last year. He reported his findings at a meeting of the American Astronomical Society on 12 January and posted them in a not-yet-peer-reviewed paper on the same day1.An 850-year-old fireworkIn 2013, amateur astronomer Dana Patchick discovered the object in archived images from NASA’s Wide-field Infrared Survey Explorer. Over the next decade, several teams studied the remnant, known as Pa 30, but the results became 

In [41]:
# oluturmuş olduğumuz veri çerçevesini csv ve xlsx dosya uzantılarıyla kaydedelim.
data.to_csv("nature.csv", index = False)
data.to_excel("nature.xlsx", sheet_name = "nature", index = False)

In [39]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 6A48-6A99

 Directory of C:\Users\azzrd\Documents\AI\AA - Excercise\nature_dergisi

01/29/2023  02:41 PM    <DIR>          .
01/29/2023  02:41 PM    <DIR>          ..
01/29/2023  01:03 PM    <DIR>          .ipynb_checkpoints
01/28/2023  05:37 AM           513,738 nature_28-01-2023.csv
01/28/2023  07:31 AM           257,252 nature_web_scraping.ipynb
01/29/2023  02:41 PM           225,682 nature_web_scraping_new.ipynb
01/28/2023  07:31 AM         1,016,475 translated_nature.csv
01/29/2023  12:02 AM            20,652 Untitled.ipynb
               5 File(s)      2,033,799 bytes
               3 Dir(s)  98,038,894,592 bytes free
