<a href="https://colab.research.google.com/github/faithrts/Science_Explainers/blob/main/science_explainer_database_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
### importing libraries

# basic libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# libraries for web scraping
from bs4 import BeautifulSoup
import requests
import re
import codecs

# sklearn libraries for ML
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
### making a folder for the txt files

!mkdir txt_files

# Importing article URLs

In [3]:
### cloning git repo and saving csv file of URLs into dataframe

!git clone https://github.com/faithrts/Science_Explainers

Cloning into 'Science_Explainers'...
remote: Enumerating objects: 112, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 112 (delta 1), reused 0 (delta 0), pack-reused 106[K
Receiving objects: 100% (112/112), 314.92 KiB | 2.23 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [4]:
### saving csv file of URLs into dataframe

urls_df = pd.read_csv('Science_Explainers/article_urls.csv')

# replaces all NaN instances with 0
urls_df = urls_df.fillna('')

In [5]:
urls_df

Unnamed: 0,ATLANTIC,CBC,CNN,GLOBE AND MAIL,MASSIVE SCI,NATIONAL GEOGRAPHIC,NATIONAL OBSERVER,NPR,NYT,REUTERS
0,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/dec-3-growling...,https://www.cnn.com/2022/11/29/world/bats-deat...,https://www.theglobeandmail.com/business/techn...,https://massivesci.com/articles/soil-runoff-re...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/03/09/ne...,https://www.npr.org/2022/12/04/1139164875/deat...,https://www.nytimes.com/2023/05/01/science/ai-...,https://www.reuters.com/technology/space/study...
1,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/black-holes-je...,https://www.cnn.com/2023/05/01/world/wales-fos...,https://www.theglobeandmail.com/world/article-...,https://massivesci.com/articles/ecofriendly-cr...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/21/an...,https://www.npr.org/sections/health-shots/2023...,https://www.nytimes.com/2023/04/28/science/fro...,https://www.reuters.com/lifestyle/oldest-known...
2,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/europe-s-juice...,https://www.cnn.com/2023/05/01/world/roman-coi...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/notes/sea-turtle-habita...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/11/in...,https://www.npr.org/2023/04/21/1171292778/rene...,https://www.nytimes.com/2023/04/27/science/qua...,https://www.reuters.com/lifestyle/science/toot...
3,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/artificial-int...,https://www.cnn.com/2023/04/29/world/ocean-spe...,https://www.theglobeandmail.com/business/artic...,https://massivesci.com/articles/outdoor-green-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/18/ne...,https://www.npr.org/2023/04/21/1170986221/cali...,https://www.nytimes.com/2023/04/25/science/gol...,https://www.reuters.com/technology/space/new-i...
4,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/habitable-plan...,https://www.cnn.com/2023/04/27/asia/elephant-h...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/bacterial-soil...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/03/21/an...,https://www.npr.org/2023/04/21/1171110131/gray...,https://www.nytimes.com/2023/04/27/science/mot...,https://www.reuters.com/science/ambitious-geno...
5,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/koala-book-dan...,https://www.cnn.com/2023/04/24/world/aurora-no...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/soil-wildfires...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/02/24/an...,https://www.npr.org/2023/04/20/1170967518/thin...,https://www.nytimes.com/2023/04/28/health/brea...,https://www.reuters.com/lifestyle/science/you-...
6,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/research-earli...,https://www.cnn.com/2023/04/18/world/vikings-g...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/urban-heating-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/03/ne...,https://www.npr.org/2023/04/19/1170806176/abor...,https://www.nytimes.com/2023/04/27/science/hum...,https://www.reuters.com/science/good-dog-with-...
7,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/feb-25-giraffe...,https://www.cnn.com/2023/04/20/world/sleep-div...,https://www.theglobeandmail.com/business/indus...,https://massivesci.com/articles/wildfire-borea...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/04/ne...,https://www.npr.org/2023/04/17/1169844428/this...,https://www.nytimes.com/2023/04/20/science/sal...,https://www.reuters.com/world/middle-east/dish...
8,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/feb-18-super-s...,https://www.cnn.com/2023/04/20/world/worms-mun...,https://www.theglobeandmail.com/business/inter...,https://massivesci.com/articles/tree-frogs-can...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/03/ne...,https://www.npr.org/sections/goatsandsoda/2023...,https://www.nytimes.com/2023/04/18/health/covi...,https://www.reuters.com/world/europe/pendant-i...
9,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/news/science/edna-vacuum-ai...,https://www.cnn.com/2023/04/19/world/carnivoro...,https://www.theglobeandmail.com/arts/books/art...,https://massivesci.com/articles/chestnut-tree-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/04/ne...,https://www.npr.org/sections/money/2023/05/02/...,https://www.nytimes.com/2023/04/13/climate/fla...,https://www.reuters.com/science/china-approves...


# Webscraping helper functions

In [6]:
def create_txt_file(soup, filename):

  # creates a new file
  cur_file = open('txt_files/' + filename, 'w+')

  # iterates through each passage in the article by finding <p> tags
  for passage in soup.findAll('p'):

    # extracts the text
    text = passage.get_text()

    # fixing spacing
    text = text.replace(u'\xa0', u' ')
    text = text.replace(u'  ', u' ')

    # adding a newline before the next passage
    text += '\n'

    # writing the text to the current file
    cur_file.write(text)

  cur_file.close()

In [7]:
def source_finder(url):
  # testing default source finder
  source = re.search('(?<=https:\/\/www\.)(.*?)(=?\.)', url).group(1)

  # if no source found, tests another link format
  if len(source) == 0:
    source = re.search('(?<=https:\/\/)(.*?)(=?\.)', url).group(1)

  return source

In [8]:
def title_cleaner(title):
  # removes punctuation
  title = re.sub(r'[^\w\s]', '', title)
  return ''.join(title.title().split()[:6])

In [9]:
def title_finder(soup, source):
  # testing default title finder
  title = soup.findAll('h1')

  # if the title is empty, test another title finder
  if len(title) == 0:
    # testing another title finder
    title = soup.findAll('title')

  title = title[0].get_text()
  title = title.replace('\n', '')

  # if the title is not empty, return it
  if len(title) != 0:
    return title

  # another title finder format for CNN articles
  if source == 'cnn':
    title = soup.findAll('h1', {'class': 'headline__text inline-placeholder'})[0].get_text()
    title = title.replace('  ', '')
    title = title.replace('\n', '')

    return title

In [10]:
def add_text_column(df):
  # adding a column for the text contents
  df['TEXT'] = ''

  for index, row in df.iterrows():
    filename = row['FILENAME']
    text = codecs.open('txt_files/' + filename, "r", encoding='utf8').read()

    df.at[index, 'TEXT'] = text

  return df

In [11]:
def create_database(urls_df):

  # the new database of science explainers
  database = pd.DataFrame(columns = ['FILENAME', 'TITLE', 'SOURCE', 'DATE PUBLISHED', 'URL'])

  # iterating through each column of the df, which translates to each source
  # of science explainers
  for col in urls_df.columns:

    cur_source = col

    new_folder_name = 'txt_files/' + cur_source
    new_folder_name = new_folder_name.replace(' ', '_')
    !mkdir $new_folder_name

    # iterating through the rows of the current column of the df, which
    # translates to the article urls from the current source
    for index, row in urls_df[col].items():

      cur_url = row

      # skip empty urls
      if cur_url == '':
        continue

      # gets the website content
      r = requests.get(cur_url)
      soup = BeautifulSoup(r.content, 'html.parser')

      # extracts the title of the article
      title = title_finder(soup, cur_source)

      # edit the title for the filename (title case and only the first 5 words)
      title_cut = title_cleaner(title)

      # creates a new file, writes entire article to it, then saves in the txt_files folder
      filename = cur_source.replace(' ', '_') + '/' + title_cut + '.txt'
      create_txt_file(soup, filename)

      # retrieves the date of publication
      date = all_date_finder(cur_source, soup)

      # adds a row to the science explainer database with the info of this article
      new_row = pd.DataFrame({'FILENAME': filename, 
                              'TITLE': title, 
                              'SOURCE': cur_source, 
                              'DATE PUBLISHED': date,
                              'URL': cur_url}, 
                             index = [0])
      #database = database.append(new_row, ignore_index=True)
      database = pd.concat([database, new_row], ignore_index = True)

  # returns the new database
  return database

## Date finder functions

In [12]:
def basic_date_finder(soup):
  date_bunch = soup.select_one('time')
  date = re.search('(?<= datetime=")(.*?)(=?T)', str(date_bunch)).group(1)
  return date

In [13]:
def atlantic_date_finder(soup):
  return basic_date_finder(soup)

In [14]:
def cbc_date_finder(soup):
  return basic_date_finder(soup)

In [15]:
def cnn_date_finder(soup):
  date_bunch = soup.find('link', {'rel': 'canonical'})
  date = re.search('(?<=cnn\.com\/)(\d\d\d\d\/\d\d\/\d\d)(=?)', str(date_bunch)).group(1)
  date = date.replace('/', '-')
  return date

In [16]:
def globeandmail_date_finder(soup):
  return basic_date_finder(soup)

In [17]:
def massivesci_date_finder(soup):
  return basic_date_finder(soup)

In [18]:
def nationalgeographic_date_finder(soup):
  soup_as_string = str(soup)
  date = re.search('(?<="pbDt":")(.*?)(=?T)', soup_as_string).group(1)
  return date

In [19]:
def nationalobserver_date_finder(soup):
  soup_as_string = str(soup)
  date = re.search('(?<="datePublished": ")(.*?)(=?T)', soup_as_string).group(1)
  return date

In [20]:
def npr_date_finder(soup):
  date_bunch = soup.find('link', {'rel': 'canonical'})
  date = re.search('(\d\d\d\d\/\d\d\/\d\d)', str(date_bunch)).group(1)
  date = date.replace('/', '-')
  return date

In [21]:
def reuters_date_finder(soup):
  date_bunch = soup.find('link', {'rel': 'canonical'})
  date = re.search('(\d\d\d\d-\d\d-\d\d)', str(date_bunch)).group(1)
  return date

In [22]:
def all_date_finder(source_name, soup):
  # the name of the specific source's date finder function name
  finder_func = source_name.lower().replace(' ', '') + '_date_finder'

  # calls the specific source's date finder function
  return (eval(finder_func)(soup))

In [23]:
def date_finder_tester(source_name, urls_df):
  # gets the urls from source source_name
  urls = urls_df[source_name]

  # the name of the specific source's date finder function name
  finder_func = source_name.lower().replace(' ', '') + '_date_finder'
  
  # iterates through the url list
  for url in urls:

    # skip empty urls
    if url == '':
      continue

    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    # prints the date found
    print(eval(finder_func)(soup))

# Testing

In [24]:
### making a copy of the df without the NYT articles, since they're behind a paywall
urls_no_nyt_df = urls_df.drop(columns = ['NYT'])

In [25]:
urls_no_nyt_df

Unnamed: 0,ATLANTIC,CBC,CNN,GLOBE AND MAIL,MASSIVE SCI,NATIONAL GEOGRAPHIC,NATIONAL OBSERVER,NPR,REUTERS
0,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/dec-3-growling...,https://www.cnn.com/2022/11/29/world/bats-deat...,https://www.theglobeandmail.com/business/techn...,https://massivesci.com/articles/soil-runoff-re...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/03/09/ne...,https://www.npr.org/2022/12/04/1139164875/deat...,https://www.reuters.com/technology/space/study...
1,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/black-holes-je...,https://www.cnn.com/2023/05/01/world/wales-fos...,https://www.theglobeandmail.com/world/article-...,https://massivesci.com/articles/ecofriendly-cr...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/21/an...,https://www.npr.org/sections/health-shots/2023...,https://www.reuters.com/lifestyle/oldest-known...
2,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/europe-s-juice...,https://www.cnn.com/2023/05/01/world/roman-coi...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/notes/sea-turtle-habita...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/11/in...,https://www.npr.org/2023/04/21/1171292778/rene...,https://www.reuters.com/lifestyle/science/toot...
3,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/artificial-int...,https://www.cnn.com/2023/04/29/world/ocean-spe...,https://www.theglobeandmail.com/business/artic...,https://massivesci.com/articles/outdoor-green-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/04/18/ne...,https://www.npr.org/2023/04/21/1170986221/cali...,https://www.reuters.com/technology/space/new-i...
4,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/habitable-plan...,https://www.cnn.com/2023/04/27/asia/elephant-h...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/bacterial-soil...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/03/21/an...,https://www.npr.org/2023/04/21/1171110131/gray...,https://www.reuters.com/science/ambitious-geno...
5,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/koala-book-dan...,https://www.cnn.com/2023/04/24/world/aurora-no...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/soil-wildfires...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/02/24/an...,https://www.npr.org/2023/04/20/1170967518/thin...,https://www.reuters.com/lifestyle/science/you-...
6,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/research-earli...,https://www.cnn.com/2023/04/18/world/vikings-g...,https://www.theglobeandmail.com/canada/article...,https://massivesci.com/articles/urban-heating-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/03/ne...,https://www.npr.org/2023/04/19/1170806176/abor...,https://www.reuters.com/science/good-dog-with-...
7,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/feb-25-giraffe...,https://www.cnn.com/2023/04/20/world/sleep-div...,https://www.theglobeandmail.com/business/indus...,https://massivesci.com/articles/wildfire-borea...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/04/ne...,https://www.npr.org/2023/04/17/1169844428/this...,https://www.reuters.com/world/middle-east/dish...
8,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/radio/quirks/feb-18-super-s...,https://www.cnn.com/2023/04/20/world/worms-mun...,https://www.theglobeandmail.com/business/inter...,https://massivesci.com/articles/tree-frogs-can...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/03/ne...,https://www.npr.org/sections/goatsandsoda/2023...,https://www.reuters.com/world/europe/pendant-i...
9,https://www.theatlantic.com/science/archive/20...,https://www.cbc.ca/news/science/edna-vacuum-ai...,https://www.cnn.com/2023/04/19/world/carnivoro...,https://www.theglobeandmail.com/arts/books/art...,https://massivesci.com/articles/chestnut-tree-...,https://www.nationalgeographic.com/science/art...,https://www.nationalobserver.com/2023/05/04/ne...,https://www.npr.org/sections/money/2023/05/02/...,https://www.reuters.com/science/china-approves...


In [26]:
### creating the database
create_database(urls_no_nyt_df)

Unnamed: 0,FILENAME,TITLE,SOURCE,DATE PUBLISHED,URL
0,ATLANTIC/HowToSuccessfullySmashYourFace.txt,How to Successfully Smash Your Face Against a ...,ATLANTIC,2022-07-14,https://www.theatlantic.com/science/archive/20...
1,ATLANTIC/WillCovidsSpringLullLast.txt,Will COVID’s Spring Lull Last?,ATLANTIC,2023-05-01,https://www.theatlantic.com/science/archive/20...
2,ATLANTIC/TeenBrainsArePerfectlyCapable.txt,Teen Brains Are Perfectly Capable,ATLANTIC,2023-04-30,https://www.theatlantic.com/science/archive/20...
3,ATLANTIC/TheFishHadGillsFullOf.txt,The Fish Had Gills Full of Ash and Gas Bubblin...,ATLANTIC,2023-04-29,https://www.theatlantic.com/science/archive/20...
4,ATLANTIC/LushPrairiesCouldReallyBeGreen.txt,Lush Prairies Could Really Be ‘Green Deserts’,ATLANTIC,2023-04-23,https://www.theatlantic.com/science/archive/20...
...,...,...,...,...,...
168,REUTERS/StudyExplainsHowPrimordialLifeSurvived...,Study explains how primordial life survived on...,REUTERS,2023-04-04,https://www.reuters.com/lifestyle/science/stud...
169,REUTERS/IntriguingMoonWaterSourceFoundIn.txt,Intriguing moon water source found in glass be...,REUTERS,2023-03-27,https://www.reuters.com/lifestyle/science/intr...
170,REUTERS/ScientistsExplainAlienCometOumuamuasSt...,Scientists explain alien comet 'Oumuamua's str...,REUTERS,2023-03-23,https://www.reuters.com/lifestyle/science/scie...
171,REUTERS/LocksOfHairComposeASymphony.txt,Locks of hair compose a symphony of genetic in...,REUTERS,2023-03-22,https://www.reuters.com/lifestyle/science/lock...


# Downloading txt files

In [27]:
from google.colab import files

!zip -r txt_files.zip txt_files
files.download("txt_files.zip")

  adding: txt_files/ (stored 0%)
  adding: txt_files/NATIONAL_OBSERVER/ (stored 0%)
  adding: txt_files/NATIONAL_OBSERVER/GasolineVersusElectricCarsHeresHow.txt (deflated 62%)
  adding: txt_files/NATIONAL_OBSERVER/TidepoweredCleanEnergyCouldHelpWest.txt (deflated 54%)
  adding: txt_files/NATIONAL_OBSERVER/MeatInModerationCanBePart.txt (deflated 58%)
  adding: txt_files/NATIONAL_OBSERVER/EscootersSilentMenaceOrGreenGodsend.txt (deflated 57%)
  adding: txt_files/NATIONAL_OBSERVER/ADeepDiveIntoHfcsOne.txt (deflated 62%)
  adding: txt_files/NATIONAL_OBSERVER/NewfoundlandAndLabradorsOilExplorationPlans.txt (deflated 57%)
  adding: txt_files/NATIONAL_OBSERVER/BisonOnThisPrairieFarmBring.txt (deflated 51%)
  adding: txt_files/NATIONAL_OBSERVER/APavedGreenbeltWillKneecapOntarios.txt (deflated 50%)
  adding: txt_files/NATIONAL_OBSERVER/UncheckedClimateChangePutsCanadasWest.txt (deflated 56%)
  adding: txt_files/NATIONAL_OBSERVER/MeetTheDogsSniffingStinkyMussels.txt (deflated 53%)
  adding: txt_

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>