# Data Scraping

In this notebook, you will find the code that I have written to scrape the news articles I have found from New York Times website.

**Notebook Contents**
- [Installs & Imports](#Installs-&-Imports)
- [Scraping](#Scraping)
- [Saving and Merging](#Saving-and-Merging)

## Installs & Imports

In [2]:
# !pip install selenium
# !pip install webdriver-manager

In [3]:
import pandas as pd
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Scraping

In [8]:
# Open and initialize the web browser
driver = webdriver.Chrome(ChromeDriverManager().install())

# Save the url needed (with required filter)
# url = "https://www.nytimes.com/search?dropmab=true&endDate=20200224&query=&sort=best&startDate=20110101"

start_date="20200220"
end_date="20200228"

url = "https://www.nytimes.com/search?dropmab=true&endDate="+ end_date +"&query=&sections=World%7Cnyt%3A%2F%2Fsection%2F70e865b6-cc70-5181-84c9-8368b3a5c34b&sort=best&startDate="+ start_date +"&types=article"
# Connect to the website url
driver.get(url)

####################################################################################################################
# no_of_clicks = 5
# while no_of_clicks > 0:
#     button = driver.find_element_by_xpath('/html/body/div/div[2]/main/div/div[2]/div[2]')
#     button.click()
#     time.sleep(1.8)
#     print(f'clicks left: {no_of_clicks}')
#     no_of_clicks -= 1
####################################################################################################################

# Finding the relevant class names for scraping
dates = driver.find_elements_by_class_name('css-17ubb9w')
categories = driver.find_elements_by_class_name('css-myxawk')
titles = driver.find_elements_by_class_name('css-2fgx4k')
articles = driver.find_elements_by_class_name('css-16nhkrn')
authors = driver.find_elements_by_class_name('css-15w69y9')

# Range of the loop
loop_length = len(driver.find_elements_by_class_name('css-1l4w6pd'))
loop_minimum = loop_length - 10
print(range(loop_minimum, loop_length))

# List of entries
total = []

# Counter for the Loop
count = 0

loop_minimum = 0

while loop_length > 0:
    for i in range(loop_minimum, loop_length):
        
        # Progress report printout (to follow the loop)
        # print(f'i is :{i}')
        
        try:
            # Required data
            date = dates[i].text
            category = categories[i].text
            title = titles[i].text
            article = articles[i].text
            author = authors[i].text
        except IndexError:
            pass
        
        # Creating a new row of data
        new_entry = ((date, category, title, article, author))

        # Adding it to our list
        total.append(new_entry)

        # time.sleep(0.1)
        count += 1
        
####################################################################################################################        
        
        # Upon reaching the last entry available
        if (count % 10) == 0:
            
            # Click the button so that new articles appear
            button = driver.find_element_by_xpath('/html/body/div/div[2]/main/div/div[2]/div[2]')
            button.click()
            time.sleep(2.5)
            
            # Informational print outs
            # print(f'count just reached {count}')
            # print(f'is it time to click the button = {i}')
            
            # print(range(loop_length))
            # Update Loop Values
            loop_minimum = loop_length
            loop_length = len(driver.find_elements_by_class_name('css-1l4w6pd'))
            
            # New Scrapables List
            dates = driver.find_elements_by_class_name('css-17ubb9w')
            categories = driver.find_elements_by_class_name('css-myxawk')
            titles = driver.find_elements_by_class_name('css-2fgx4k')
            articles = driver.find_elements_by_class_name('css-16nhkrn')
            authors = driver.find_elements_by_class_name('css-15w69y9')
            
            print(range(loop_minimum,loop_length))
            print(f'loop continues :{i}')
            print()



Looking for [chromedriver 80.0.3987.106 mac64] driver in cache 
File found in cache by path [/Users/ataakca/.wdm/drivers/chromedriver/80.0.3987.106/mac64/chromedriver]
range(0, 10)
range(10, 20)
loop continues :9

range(20, 30)
loop continues :19



NoSuchWindowException: Message: no such window: window was already closed
  (Session info: chrome=80.0.3987.132)


***Important note:*** This code is still in progress, as you can see it breaks due to more than 3 different reasons and I thought it would be a waste of time to master it. But I used it like this to gather articles partially (500-1000 at a time), everytime it broke I had to add the articles in hand to the 'merged' dataframe and continue with the last entry date.

## Saving and Merging

In [4]:
# For 2020 news that are missing year in the date:
# new_df = pd.DataFrame(total, columns=['date','category','title','article','author'])
# new_df['date'] = new_df['date'] + ', 2020'
# new_df['date'] = pd.to_datetime(new_df['date'])

In [5]:
# For those with the year in the date information
new_df = pd.DataFrame(total, columns=['date','category','title','article','author'])
new_df['date'] = pd.to_datetime(new_df['date'])
new_df.tail()

Unnamed: 0,date,category,title,article,author
25,2018-12-31,MIDDLE EAST,Video on Turkish TV Is Said to Show Khashoggi’...,The television network A Haber broadcast foota...,By Alan Yuhas
26,2019-01-01,ASIA PACIFIC,Taiwan’s Leader Urges China to Address Differe...,President Tsai Ing-wen stressed that Taiwan’s ...,By Chris Horton
27,2019-01-01,AMERICAS,Brazil Wanted Change. Even Before Taking Offic...,"In electing Mr. Bolsonaro, a far-right politic...",By Ernesto Londoño and Manuela Andreoni
28,2019-01-01,ASIA PACIFIC,A Photographer’s Quest to Reverse China’s Hist...,The Chinese photographer Li Zhensheng has been...,By Amy Qin
29,2018-12-31,ASIA PACIFIC,"Kim Jong-un, Ready to Meet Trump ‘at Any Time,...","In his New Year’s Day speech, North Korea’s le...",By Motoko Rich and David E. Sanger


In [13]:
new_df.dtypes

date        datetime64[ns]
category            object
title               object
article             object
author              object
dtype: object

In [7]:
# Merging the newly found articles to the backup data called 'merged'
merged = pd.DataFrame()
frames = [new_df, merged]
merged = pd.concat(frames)
merged.drop_duplicates(inplace=True)

In [9]:
merged = merged.sort_values(by='date',ascending=False).reset_index(drop='index')

In [10]:
merged.head()

Unnamed: 0,date,category,title,article,author
0,2019-01-02,MIDDLE EAST,Saudi Arabia Denies Issuing American Weapons t...,Sudanese soldiers fighting for the Saudi-led c...,By David D. Kirkpatrick
1,2019-01-02,EUROPE,Where Doulas Calm Nerves and Bridge Cultures D...,"In Sweden, midwives deliver babies. But doula ...",By Christina Anderson
2,2019-01-02,EUROPE,"Spurning Erdogan’s Vision, Turks Leave in Drov...",Driven by fear of persecution and economic mis...,By Carlotta Gall
3,2019-01-02,EUROPE,"Freed From Forced Marriages, U.K. Women Stuck ...",Britain’s Foreign Office brought home 82 women...,By Benjamin Mueller
4,2019-01-02,EUROPE,Uffizi Prods Germans to Return Painting Stolen...,"A German family has the artwork, and refuses t...",By Elisabetta Povoledo


In [11]:
# I wanted to create a backup dataframe to save the merged to,
# in case I do something that is irreversible, I will have more than one copy to revert back
# to normal. Also, ALWAYS SAVE YOUR DATA.
backup = merged
backup.to_csv('../Datasets/backup_data.csv', index=False)