# [newspaper]

### 概要
- 記事をスクレイピングするためのライブラリ

### 使い方
- サイトのトップページから複数のページをまとめて取得する

### 参考
- https://github.com/codelucas/newspaper
- [Pythonでスクレイピングによるニュース記事の取得と保存(CSVデータ)
](https://ai-inter1.com/webscraping_newspaper_2/)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### newspaperをimportする

In [29]:
# !pip3 install newspaper3k
import newspaper

# 必要ライブラリのインストール
# https://github.com/codelucas/newspaper/blob/master/requirements.txt

### CNNのサイトから記事を収集する
- 一度buildすると、一定時間buildできない可能性あり

In [40]:
cnn_paper = newspaper.build('http://cnn.com')

In [41]:
for article in cnn_paper.articles:
    print(article.url)
    print(article.title)

In [36]:
print(cnn_paper.size())

572


In [16]:
article.download()

In [18]:
# article.html
article.parse()

In [19]:
article.authors

[]

In [20]:
article.publish_date

datetime.datetime(2013, 12, 30, 0, 0)

In [21]:
article.text

'By Leigh Ann Caldwell\n\nWASHINGTON (CNN) — Not everyone subscribes to a New Year’s resolution, but Americans will be required to follow new laws in 2014.\n\nSome 40,000 measures taking effect range from sweeping, national mandates under Obamacare to marijuana legalization in Colorado, drone prohibition in Illinois and transgender protections in California.\n\nAlthough many new laws are controversial, they made it through legislatures, public referendum or city councils and represent the shifting composition of American beliefs.\n\nFederal: Health care, of course, and vending machines\n\nThe biggest and most politically charged change comes at the federal level with the imposition of a new fee for those adults without health insurance.\n\nFor 2014, the penalty is either $95 per adult or 1% of family income, whichever results in a larger fine.\n\nThe Obamacare, of Affordable Care Act, mandate also requires that insurers cover immunizations and some preventive care.\n\nAdditionally, mil

In [22]:
article.top_image

'http://fox13now.com/apple-touch-icon.png'

In [26]:
# !pip install --user -U nltk
import nltk
nltk.download('punkt')
article.nlp()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/akirakawai/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [27]:
article.keywords

['obamacare',
 'national',
 'minimum',
 'laws',
 'family',
 'state',
 'drones',
 'guns',
 'leave',
 'states',
 'law',
 'latest',
 'wage',
 'pot']

In [28]:
article.summary

'Oregon: Family leave in Oregon has been expanded to allow eligible employees two weeks of paid leave to handle the death of a family member.\nArkansas: The state becomes the latest state requiring voters show a picture ID at the voting booth.\nMinimum wage and former felon employmentWorkers in 13 states and four cities will see increases to the minimum wage.\nNew Jersey residents voted to raise the state’s minimum wage by $1 to $8.25 per hour.\nCalifornia is also raising its minimum wage to $9 per hour, but workers must wait until July to see the addition.'

### hotなキーワードを探す

In [42]:
newspaper.hot()

['Calvin Ridley',
 'Oil prices',
 "International Women's Day 2022",
 'Tottenham',
 'Knicks',
 'School closings',
 'Pasha Lee',
 'San Marino',
 'Dana Blumberg',
 'Ivan Kuliak',
 'Bill Cosby',
 'Andrew Cuomo',
 'Cavs',
 'Illinois basketball',
 'Manchester United',
 'ACM Awards 2022',
 'Lil Bo Weep',
 'Outlander',
 'AEW Revolution 2022',
 'Jayson Tatum']

In [45]:
newspaper.popular_urls()[:10]

['http://www.huffingtonpost.com',
 'http://cnn.com',
 'http://www.time.com',
 'http://www.ted.com',
 'http://pandodaily.com',
 'http://www.cnbc.com',
 'http://www.mlb.com',
 'http://www.pcmag.com',
 'http://www.foxnews.com',
 'http://theatlantic.com']

### 試しに日経新聞をスクレイピングしてみる

In [None]:
url = "https://www.nikkei.com/technology/ai/"

In [47]:
website = newspaper.build(url)

In [48]:
for item, article in enumerate(website.articles):
    website_article = website.articles[item]
    website_article_url = website_article.url
    try:
        website_article.download()
        website_article.parse()
        website_article.nlp()
        print("記事[" + str(item) + "]: "+ website_article_url +" : " + website_article.summary + "\n")
    except:
        print("記事[" + str(item) + "]: "+ website_article_url +" : " + "取得エラー" + "\n")
    continue

記事[0]: https://www.fox13now.com/news/local-news : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[1]: https://www.fox13now.com/news/national-news : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[2]: https://www.fox13now.com/news/3-questions : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[3]: https://www.fox13now.com/news/booming-forward : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[4]: https://www.fox13now.com/news/car-critic : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[5]: https://www.fox13now.com/news/health : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[6]: https://www.fox13now.com/news/politics : Would you like to receive local news notifications on your desktop?
Yes pleaseNot now

記事[7]: https://www.fox13now.com/news/p