# Part 1: Web Scraping

This notebook covers a little bit about the web scraping techniques I used when scraping web articles online. This was a test notebook that helped me experiment and understand the code. I eventually took this code to be put on a python script that did the scraping through the terminal. The key plugin I used to help me was Newspaper3k. Which a great tool to webscrape articles.

To know more, I have written an article about it! [Click here](https://andrewhnberry.github.io/articles/2020-04/The-Easy-Way-to-Web-Scrape-Articles-Online)!

In [276]:
# Import Newspaper API that I 
import newspaper
from newspaper import Article
from newspaper import Source

# Import other important plugins that I will be using. 
import pandas as pd

### How to Web Scrape Articles From a Website?

In [274]:
#Lets say we wanted to download articles from GameSpot
# I set memoize_articles to False, because I don't want it to cache and save
gamespot = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)

#Prints out how many links this plugin identifies that could be download.
print(gamespot.size())

22


In [None]:
#Initiate DataFrame to store data
final_df = pd.DataFrame()

for each_article in gamespot.articles:
    
    each_article.download() #Download each article, but be done to get the data I want to extract
    each_article.parse()    #Parsing the Data
    each_article.nlp()      #Execute only if you want to use the built in NLP functions for (Keywords & Summary)

    # Tempoary DataFrame to store the data elements of this particular article
    temp_df = pd.DataFrame(columns = ['Title', 'Authors',
                                      'Text','Summary',
                                      'published_date','Source'])
    # Self exploratory on the data elements I am extracting here
    temp_df['Authors'] = each_article.authors
    temp_df['Title'] = each_article.title
    temp_df['Text'] = each_article.text
    temp_df['Summary'] = each_article.summary
    temp_df['published_date'] = each_article.publish_date
    temp_df['Source'] = each_article.source_url
    
    # Append to the Initial Data Frame   
    final_df = final_df.append(temp_df, ignore_index = True)

In [275]:
# Seems like it works
final_df.head(3)

Unnamed: 0,Title,Authors,Text,Summary,published_date,Source
0,Mario Games We Want Ported To Nintendo Switch,Alessandro Fillari,"Mario may be a platforming icon, but he's also...","Mario may be a platforming icon, but he's also...",2020-04-08 19:06:42+00:00,https://www.gamespot.com
1,8 Interesting Xbox Game Pass Games You Probabl...,James O'Connor,Xbox Game Pass subscribers are spoiled for cho...,Xbox Game Pass subscribers are spoiled for cho...,2020-04-08 20:30:00+00:00,https://www.gamespot.com/pc
2,The Best Upcoming TV Shows To Watch In 2020 (A...,Dan Auty,"In many ways, 2019 felt like a transitional pe...","In many ways, 2019 felt like a transitional pe...",2020-04-08 00:00:00,https://www.gamespot.com


In [41]:
# Fun Feature that shows the current search trends on google whenever execute
newspaper.hot()

['Bernie Sanders',
 'Snapchat',
 'Passover',
 "Dick's Sporting Goods",
 'Amber Heard',
 'IRS Economic stimulus checks',
 'Modern Family',
 'Wuhan',
 'Rex Manning Day',
 'Zinc',
 'Modern Warfare Season 3',
 'Happy Passover!',
 'Falcons new uniforms',
 'Alan Rickman',
 'Glenn Fine',
 'John Prine',
 'PS5 controller',
 'Valorant',
 'Kayleigh McEnany',
 'Jameis Winston']

In [42]:
# Fun Feature that shows the top 100 popular URLS
newspaper.popular_urls()

['http://www.huffingtonpost.com',
 'http://cnn.com',
 'http://www.time.com',
 'http://www.ted.com',
 'http://pandodaily.com',
 'http://www.cnbc.com',
 'http://www.mlb.com',
 'http://www.pcmag.com',
 'http://www.foxnews.com',
 'http://theatlantic.com',
 'http://www.bbc.co.uk',
 'http://www.vice.com',
 'http://www.elle.com',
 'http://www.vh1.com',
 'http://espnf1.com',
 'http://espn.com',
 'http://www.npr.org',
 'http://www.sfgate.com',
 'http://www.glamour.com',
 'http://www.whosdatedwho.com',
 'http://kotaku.com',
 'http://thebostonchannel.com',
 'http://www.suntimes.com',
 'http://www.businessinsider.com',
 'http://www.rivals.com',
 'http://thebusinessjournal.com',
 'http://www.newrepublic.com',
 'http://allthingsd.com',
 'http://www.topgear.com',
 'http://thecitizen.com',
 'http://www.ign.com',
 'http://www.sci-news.com',
 'http://www.morningstar.com',
 'http://www.variety.com',
 'http://thebottomline.as.ucsb.edu',
 'http://www.gamefaqs.com',
 'http://blog.searchenginewatch.com',
 'h

### How to download articles from multiple sources quickly!

In [67]:
#This tools helps us use multithreading technologies
from newspaper import news_pool

In [268]:
#Test
gamespot = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc = newspaper.build("https://www.bbc.com/news", memoize_articles=False)

papers = [gamespot, bbc]

news_pool.set(papers, threads_per_source=4) 
news_pool.join()

In [272]:
#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1
        

    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')

source extracted
source extracted


In [273]:
#Seems like a success so far!
df_articles.

Unnamed: 0,Title,Text,Source
0,Mario Games We Want Ported To Nintendo Switch,"Mario may be a platforming icon, but he's also...",https://www.gamespot.com
1,8 Interesting Xbox Game Pass Games You Probabl...,Xbox Game Pass subscribers are spoiled for cho...,https://www.gamespot.com/pc
2,The Best Upcoming TV Shows To Watch In 2020 (A...,"In many ways, 2019 felt like a transitional pe...",https://www.gamespot.com
3,22 Funniest Netflix Comedy TV Shows: From Comm...,Sometimes you just have to laugh. With everyon...,https://www.gamespot.com
4,The Last of Us Part II,New The Last Of Us 2 Screenshots Drop After De...,https://www.gamespot.com/xbox-series-x
...,...,...,...
154,Special reports,Brexit coverage\n\nWhat you need to know about...,https://www.bbc.com/news
155,Long Reads,Roberto Firmino was a shy young boy. His smile...,https://www.bbc.com/news
156,Your Coronavirus Stories,Video 3 minutes 18 seconds\n\n'We are proud to...,https://www.bbc.com/news
157,BBC News,"Without a lockdown, South Korea has had great ...",https://www.bbc.com/news


### Similar method to above however, less efficient. Old Code

In [9]:
sources = ["https://www.theglobeandmail.com/", "https://thepostmillennial.com/politics",
           'https://www.ign.com/news', 'https://www.gamespot.com/news/', 'https://techcrunch.com/'] 

for source in sources:
    news_data = newspaper.build(source, memoize_articles=False)
    print(source)
    print(news_data.size())

https://www.theglobeandmail.com/
669
https://thepostmillennial.com/politics
73
https://www.ign.com/news
180
https://www.gamespot.com/news/
56
https://techcrunch.com/
20


In [4]:
df_articles_2 = pd.DataFrame()

#create loop to sources and start extracting
#Create download limit per source
limit = 4


for source in sources:
    news_data = newspaper.build(source, memoize_articles=False)
    # length_data = news_data.size()
    
    count = 0
    
    for article_extract in news_data.articles:
        #article_extract = news_data.articles[i]
        
        if count > limit:
            break
        
        article_extract.download()
        article_extract.parse()
        article_extract.nlp()
        
        temp_df = pd.DataFrame(columns = ['Title', 'Authors',
                                      'Text','Summary',
                                      'published_date','Source'])
        
            
        temp_df['Authors'] = article_extract.authors
        temp_df['Title'] = article_extract.title
        temp_df['Text'] = article_extract.text
        temp_df['Summary'] = article_extract.summary
        temp_df['published_date'] = article_extract.publish_date
        temp_df['Source'] = article_extract.source_url
        
        df_articles = df_articles.append(temp_df, ignore_index = True)
        
        count +=1 


In [85]:
df_articles_2

Unnamed: 0,Title,Authors,Text,Summary,published_date,Source
0,Spain coronavirus: Drive-through funerals in M...,Scott Mclean,"Madrid, Spain (CNN) Every 15 minutes or so, a ...","Madrid, Spain (CNN) Every 15 minutes or so, a ...",2020-04-06 00:00:00,https://edition.cnn.com/world
1,Spain coronavirus: Drive-through funerals in M...,Laura Perez Maestro,"Madrid, Spain (CNN) Every 15 minutes or so, a ...","Madrid, Spain (CNN) Every 15 minutes or so, a ...",2020-04-06 00:00:00,https://edition.cnn.com/world
2,They used to sell food to top chefs. Now you'r...,Hanna Ziady,London (CNN Business) Millions of restaurants ...,"The company signed up more than 30,000 custome...",2020-04-08 00:00:00,https://edition.cnn.com/business
3,They used to sell food to top chefs. Now you'r...,Cnn Business,London (CNN Business) Millions of restaurants ...,"The company signed up more than 30,000 custome...",2020-04-08 00:00:00,https://edition.cnn.com/business
4,Bernie Sanders Quits White House Race,Joshua Caplan,Sen. Bernie Sanders (I-VT) announced the suspe...,Sen. Bernie Sanders (I-VT) announced the suspe...,2020-04-08 00:00:00,https://www.breitbart.com
5,Chinese Official Claiming U.S. Army Made Coron...,Frances Martel,Chinese Foreign Ministry spokesman Zhao Lijian...,Chinese Foreign Ministry spokesman Zhao Lijian...,2020-04-08 00:00:00,https://www.breitbart.com/africa
6,Trump’s Economic Approval Rating Hits Highest ...,John Carney,The coronavirus pandemic has thrown the Americ...,Approval of President Trump’s handling of the ...,2020-04-08 00:00:00,https://www.breitbart.com
7,Fauci: We Should Start to See ‘Beginning of a ...,Trent Baker,Director of the National Institute of Allergy ...,"Fauci shared in that at this time, the White H...",2020-04-08 00:00:00,https://www.breitbart.com
8,Kamala Harris: ‘People Are Dying’ Because of T...,Pam Key,"Wednesday on ABC’s “The View,” Sen. Kamala Har...","Wednesday on ABC’s “The View,” Sen. Kamala Har...",2020-04-08 00:00:00,https://www.breitbart.com/news
9,Five Wisconsin Voters on What It Feels Like to...,Molly Olmstead,Voters at Riverside University High School in ...,Wisconsin voted Tuesday in the midst of a pand...,2020-04-08 01:20:18+00:00,https://slate.com/commenting
