In [1]:
from requests import get
from bs4 import BeautifulSoup

import pandas as pd
import re

import os
import json
import env
from sqlalchemy import text, create_engine

# Exercises
Goal: acquire.py with two functions: get_blog_articles and get_news_articles

## 1. Codeup Blog Articles
- visit Codeup's Blog and record urls for 5 distinct blog posts. For each post, scrape at least the post's title and content. get_blog_articles should return a list of dictionaires with each dict representing one article. Shape of each dict should look like this

{ 'title': 'the title of the article',
  'content': 'the full text content of the article'
}

Plus any additional properties you think might be useful
Bonus: scrape the text of all the articles linked on Codeup's blog page

## Steps

    1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
    2. Assign the address of the web page to a variable named url.
    3. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
    4. Print the response text to ensure you have an html page.
    5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
    6. Use BeautifulSoup to parse the HTML into a variable ('soup').
    7. Identify the key tags you need to extract the data you are looking for.
    8. Create a dataframe of the data desired.
    9. Run some summary stats and inspect the data to ensure you have what you wanted.
    10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
    11. Create a corpus of the column with the text you want to analyze.
    12. Store that corpus for use in a future notebook.


In [2]:
# creater variables with 5 urls from codeup's blog
url1 = 'https://codeup.com/featured/apida-heritage-month/'
url2 = 'https://codeup.com/featured/women-in-tech-panelist-spotlight/'
url3 = 'https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/'
url4 = 'https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/'
url5 = 'https://codeup.com/events/women-in-tech-madeleine/'
url_list = [url1, url2, url3, url4, url5]

In [3]:
# set my user-agent (this is the one from my computer/browser)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}

In [4]:
# get a response
response = get(url1, headers=headers)

In [5]:
response

<Response [200]>

In [6]:
# check for a good response (one that looks like an html page's text)
response.text[:100]

'<!DOCTYPE html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8" />\n<meta http-equiv="X-UA-Compatib'

In [7]:
# make a soup variable holding the reponse content
soup = BeautifulSoup(response.content, 'html.parser')

## Beautiful Soup Methods and Properties

    * soup.title.string gets the page's title (the same text in the browser tab for a page, this is the <title> element
    * soup.prettify() is useful to print in case you want to see the HTML
    * soup.find_all("a") find all the anchor tags, or whatever argument is specified.
    * soup.find("h1") finds the first matching element
    * soup.get_text() gets the text from within a matching piece of soup/HTML
    * The soup.select() method takes in a CSS selector as a string and returns all matching elements. super useful


In [8]:
soup.find('title')

<title>Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa</title>

In [9]:
soup.find('title').get_text()

'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa'

In [10]:
# use the soup object and assign the text from the article to a variable
### after some searching, 'div' and id='main-content' are the things that help extract the text of the article
article = soup.find('div', id='main-content')

In [11]:
article.text[:100]

'\n\n\n\n\n\nSpotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa\nMay 24,'

In [12]:
dict_el = {'title': soup.find('title').get_text(), 'content': article.text}

In [13]:
def get_title_content(url):
    """
    This function will
    - accept a string, url, which is thet web address for a single codeup blog article
    - return a dictionary with two elements:
        - 'title' : the title of the blog
        - 'content' : the text content of the article
    """
    # set my user-agent (this is the one from my computer/browser)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}
    # get a response
    response = get(url, headers=headers)
    # make a soup variable holding the reponse content
    soup = BeautifulSoup(response.content, 'html.parser')
    # get the title
    title = soup.find('title').get_text()
    # use the soup object and assign the text from the article to a variable
    article = soup.find('div', id='main-content').text
    # clean up what came back by removing all the \n characters and some extra stuff at the end
    article = article.replace('\n', ' ')
    regexp = r'(.+)Our ProgramsC'
    article = re.findall(regexp, article)
    article = article[0].strip()
    # assign results to a dictionary
    results_dict = {'title': title, 'content': article}
    return results_dict

In [14]:
get_title_content(url1)

{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
 'content': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa May 24, 2023 | Featured   May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.  In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inclu

In [15]:
#### trying to not grab the last part of the article
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}
# get a response
response = get(url1, headers=headers)
# make a soup variable holding the reponse content
soup = BeautifulSoup(response.content, 'html.parser')
# get the title
title = soup.find('title').get_text()
# use the soup object and assign the text from the article to a variable
article = soup.find('div', id='main-content')

In [16]:
# soup.get_text()

In [17]:
test1 = article.text.strip()
test2 = test1.replace('\n', ' ')

In [18]:
# further work is needed to lop off leading whitespace and trailing stuff that isn't really part of the article
# I started to work it here, but I need to move on. Will come back. Maybe.
regexp = r'(.+)Our ProgramsC'
new = re.findall(regexp, test2)
new[0].strip()

'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa May 24, 2023 | Featured   May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.  In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American ind

In [19]:
def get_blog_articles(url_list):
    """
    This function will
    - accept a list of urls that are web addresses for codeup blog articles
    - return a list of dictionaries of the form
        {'title': title of blog, 'content': text of content of blog article}
    """
    results_list = []
    for url in url_list:
        # get dictionary of results for each url
        results_dict = get_title_content(url)
        results_list.append(results_dict)
        
    return results_list

In [20]:
results_list = get_blog_articles(url_list)

In [21]:
results_list

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa May 24, 2023 | Featured   May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.  In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inc

## 2. News Articles
We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    Business
    Sports
    Technology
    Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

    Start by inspecting the website in your browser. Figure out which elements will be useful.
    Start by creating a function that handles a single article and produces a dictionary like the one above.
    Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
    Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.


## Steps

    1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
    2. Assign the address of the web page to a variable named url.
    3. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
    4. Print the response text to ensure you have an html page.
    5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
    6. Use BeautifulSoup to parse the HTML into a variable ('soup').
    7. Identify the key tags you need to extract the data you are looking for.
    8. Create a dataframe of the data desired.
    9. Run some summary stats and inspect the data to ensure you have what you wanted.
    10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
    11. Create a corpus of the column with the text you want to analyze.
    12. Store that corpus for use in a future notebook.


In [22]:
def get_news_articles(fresh=False):
    """
    This function will
    - read in news articles from a locally cached file, news_articles.json if it exists;
    - if the cached file does not exist, or if fresh=True, this function will:
        - webscrape short news articles from inshorts.com
        - business, sports, technology, entertainment sections only
        - return a list of dictionaries of the form
            {'category': type of news (sports, tech, etc.)
            ,'title': title of article
            ,'content': contents of article}
    """
    # see if cached file exists
    cache_file_name = 'news_articles.json'
    if os.path.isfile(cache_file_name) and (not fresh):
        results_dict_list = json.load(open(cache_file_name))
        print ('cached file found and read')
        return results_dict_list
    else:
        # initialize results variable, a list of dictionaries
        results_dict_list = []
        # set user-agent
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}
        # set url for inshorts
        url = 'https://inshorts.com/en/read/'
        # make list of sections
        news_section_list = ['business', 'sports', 'technology', 'entertainment']
        # now for each news_section, scrape from url + news_section
        for section in news_section_list:
            # get a response
            response = get(url+section, headers=headers)
            # make a soup variable holding the reponse content
            soup = BeautifulSoup(response.content, 'html.parser')
            # separate titles and contents out
            titles = soup.find_all('div', class_='news-card-title')
            contents = soup.find_all('div', class_='news-card-content')
            # for each element in title/contents, add it to our results
            for title, content in zip(titles, contents):
                t = title.span.text
                c = content.div.text
                results_dict_list.append({'category':section, 'title':t, 'content':c})
        # write fresh results to cached file
        with open(cache_file_name, 'w') as cache:
            json.dump(results_dict_list, cache)
        print('web scraped and written to cached file')
        
        return results_dict_list
    

In [23]:
results_dict_list = get_news_articles()
results_dict_list[:2]

cached file found and read


[{'category': 'business',
  'title': 'Sensex, Nifty end at fresh closing highs',
  'content': 'Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588. '},
 {'category': 'business',
  'title': "TIME releases list of the world's 100 most influential companies",
  'content': 'TIME magazine has released its annual list of the world\'s 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian\'s SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. "NPCI launched UPI...which accounted for 52% of India\'s digital transactions in FY22," TIME said.'}]

In [24]:
with open('news_articles.json', 'w') as cache:
    json.dump(results_dict_list, cache)


In [25]:
temp_dict_list = json.load(open('news_articles.json'))

In [26]:
temp_dict_list[:2]

[{'category': 'business',
  'title': 'Sensex, Nifty end at fresh closing highs',
  'content': 'Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588. '},
 {'category': 'business',
  'title': "TIME releases list of the world's 100 most influential companies",
  'content': 'TIME magazine has released its annual list of the world\'s 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian\'s SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. "NPCI launched UPI...which accounted for 52% of India\'s digital transactions in FY22," TIME said.'}]

In [27]:
list1 = [1,2,3]
list2 = [2,3,4]

for i, j in zip(list1, list2):
    print(i*j)

2
6
12


In [28]:
soup

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-overlay-visible","ove

In [29]:
# itemid's are important - they contain the links of the articles we need to follow
# each time the itemid is a url, there is a <div> with itemtype = below
articles = soup.find_all('div', itemtype='http://schema.org/NewsArticle')

In [30]:
##### TEST acquire.py as import

In [31]:
import acquire as a

In [32]:
url_list

['https://codeup.com/featured/apida-heritage-month/',
 'https://codeup.com/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.com/events/women-in-tech-madeleine/']

In [33]:
codeup_blogs = a.get_blog_articles(url_list)

In [34]:
codeup_blogs

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa May 24, 2023 | Featured   May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.  In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inc

In [35]:
inshorts_articles = a.get_news_articles()

cached file found and read


In [36]:
inshorts_articles[:2]

[{'category': 'business',
  'title': 'Sensex, Nifty end at fresh closing highs',
  'content': 'Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588. '},
 {'category': 'business',
  'title': "TIME releases list of the world's 100 most influential companies",
  'content': 'TIME magazine has released its annual list of the world\'s 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian\'s SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. "NPCI launched UPI...which accounted for 52% of India\'s digital transactions in FY22," TIME said.'}]

# WORKING ON BONUS - GET ALL CODEUP BLOGS

## Next step: get the older entries link and follow it to the next page

In [37]:
#### Working on bonus for #1 (get ALL of the codeup blogs)

In [38]:
url = 'https://codeup.com/blog/'
# set my user-agent (this is the one from my computer/browser)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}

In [39]:
response = get(url, headers=headers)

In [40]:
soup = BeautifulSoup(response.content, 'html.parser')

In [41]:
# this is how to get the actual links to the blog articles on each page
more_links = soup.find_all('a', class_='more-link')
more_links

[<a class="more-link" href="https://codeup.com/featured/apida-heritage-month/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/women-in-tech-panelist-spotlight/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/">read more</a>,
 <a class="more-link" href="https://codeup.com/events/women-in-tech-madeleine/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/panelist-spotlight-4/">read more</a>]

In [42]:
more_links = [link['href'] for link in more_links]
more_links

['https://codeup.com/featured/apida-heritage-month/',
 'https://codeup.com/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.com/events/women-in-tech-madeleine/',
 'https://codeup.com/codeup-news/panelist-spotlight-4/']

In [43]:
links = soup.select('a')
links[0].text.__contains__('H')

True

In [45]:
links = soup.select('a')
for link in links:
    if link.text.__contains__('Older Entries'):
        print (link)
        older_entry = link['href']
        print(older_entry)

<a href="https://codeup.com/blog/page/2/?et_blog">« Older Entries</a>
https://codeup.com/blog/page/2/?et_blog


In [46]:
older_entry

'https://codeup.com/blog/page/2/?et_blog'

In [52]:
def get_all_codeup_blogs(start_url = 'https://codeup.com/blog/', fresh=False):
    """
    This function will
    - web scrape the title and content of all codeup blog articles
    - and save them to a locally cached json file
    - (OR read them from said json file)
    - return a list of dictionaries of the form:
        { 'title': title of the article
        , 'content': full text content of the article}
    """
    # see if cached file exists
    cache_file_name = 'codeup_blogs.json'
    if os.path.isfile(cache_file_name) and (not fresh):
        results_dict_list = json.load(open(cache_file_name))
        print ('cached file found and read')
        return results_dict_list
    
    # else go webscrape it 
    # initialize results
    results_dict_list = []
    # set my user-agent (this is the one from my computer/browser)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}
    # setup for the while loop:
    next_url = start_url
    
    # as long as there is an "Older Entries" link to follow, keep following it
    while True:
        # get the response from the next page of blog titles
        response = get(next_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        # parse soup for all the blog post links that lead to actual blogs
        more_links = soup.find_all('a', class_='more-link')
        more_links = [link['href'] for link in more_links]
        
        # for each link, go get the title and content
        for link in more_links:
            # get dictionary of results for each url
            results_dict = get_title_content(link)
            results_dict_list.append(results_dict)

        # see if there is another link for Older Entries
        all_links = soup.select('a')
        # set a flag to decide whether to continue to while loop or not
        flag = False
        # brute force: check each link to see if it contains the text 'Older Entries'
        for link in all_links:
            if link.text.__contains__('Older Entries'):
                # if this is true, there is another link to follow; set it as next_url
                next_url = link['href']
                # and set flag to be True, then break out of this for loop
                flag = True
                continue
        # if flag is true, start over at the top of the while loop and scrape the next page of blogs        
        if flag: 
            continue
        # if flag was false, then we're done!
        break
        
    # write fresh results to cached file
    with open(cache_file_name, 'w') as cache:
        json.dump(results_dict_list, cache)
    print('web scraped and written to cached file')

    return results_dict_list
    
    

In [53]:
codeup_blogs = a.get_all_codeup_blogs()

cached file found and read


# playing with my weekly woms

In [55]:
url = 'https://sites.google.com/view/weeklywom136/'
# set my user-agent (this is the one from my computer/browser)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'}

In [65]:
# get a response
response = get(url, headers=headers)

In [66]:
response

<Response [200]>

In [67]:
# check for a good response (one that looks like an html page's text)
response.text[:100]

'<!DOCTYPE html><html lang="en-US" itemscope itemtype="http://schema.org/WebPage"><head><meta charset'

In [68]:
# make a soup variable holding the reponse content
soup = BeautifulSoup(response.content, 'html.parser')

In [70]:
texts = soup.find_all('span', class_='C9DxTc')

In [73]:
text_list = [t.text for t in texts]
article = ' '.join(text_list)
article

"Weekly WOM #136 18 Jun 2023 Happy  Father's Day  all, Weird how there were two Father's Days this year. :-)\xa0 Just kidding, I guess I got myself a week ahead last week. Don't worry, it's not my first mistake. My ego has recovered. \xa0\xa0 Happenings in San Antonio I brought home some pastries from  La Panaderia  on Friday. Codeup is in the same building as this restaurant, and it's a wonder I haven't gained more weight these last four months. So good. The class ahead of mine graduated on Friday which means the end is in sight. Then I'll have to finally grow up, maybe, and get a job. In the meantime, I am successfully procrastinating on that task while I look at things to do in Indianapolis. Caitlyn, Curtis, and I are taking a trip in August. Why Indianapolis? Why not?\xa0\xa0 \xa0 What I'm Reading My reading pace has slowed again because I've been wrapped up watching a show called  Bosch  on Amazon Prime. It's a detective show, set in LA. The protagonist, Harry Bosch, reminds me a 

In [74]:
import prepare as p

In [75]:
clean_article = p.basic_clean(article)
clean_article

"weekly wom 136 18 jun 2023 happy  father's day  all weird how there were two father's days this year   just kidding i guess i got myself a week ahead last week don't worry it's not my first mistake my ego has recovered    happenings in san antonio i brought home some pastries from  la panaderia  on friday codeup is in the same building as this restaurant and it's a wonder i haven't gained more weight these last four months so good the class ahead of mine graduated on friday which means the end is in sight then i'll have to finally grow up maybe and get a job in the meantime i am successfully procrastinating on that task while i look at things to do in indianapolis caitlyn curtis and i are taking a trip in august why indianapolis why not     what i'm reading my reading pace has slowed again because i've been wrapped up watching a show called  bosch  on amazon prime it's a detective show set in la the protagonist harry bosch reminds me a little bit of the pulp fiction detectives of 70 i

In [76]:
tokenized_article = p.tokenize(clean_article)
tokenized_article

"weekly wom 136 18 jun 2023 happy father ' s day all weird how there were two father ' s days this year just kidding i guess i got myself a week ahead last week don ' t worry it ' s not my first mistake my ego has recovered happenings in san antonio i brought home some pastries from la panaderia on friday codeup is in the same building as this restaurant and it ' s a wonder i haven ' t gained more weight these last four months so good the class ahead of mine graduated on friday which means the end is in sight then i ' ll have to finally grow up maybe and get a job in the meantime i am successfully procrastinating on that task while i look at things to do in indianapolis caitlyn curtis and i are taking a trip in august why indianapolis why not what i ' m reading my reading pace has slowed again because i ' ve been wrapped up watching a show called bosch on amazon prime it ' s a detective show set in la the protagonist harry bosch reminds me a little bit of the pulp fiction detectives of

In [77]:
stem_article = p.stem(tokenized_article)
stem_article

"weekli wom 136 18 jun 2023 happi father ' s day all weird how there were two father ' s day thi year just kid i guess i got myself a week ahead last week don ' t worri it ' s not my first mistak my ego ha recov happen in san antonio i brought home some pastri from la panaderia on friday codeup is in the same build as thi restaur and it ' s a wonder i haven ' t gain more weight these last four month so good the class ahead of mine graduat on friday which mean the end is in sight then i ' ll have to final grow up mayb and get a job in the meantim i am success procrastin on that task while i look at thing to do in indianapoli caitlyn curti and i are take a trip in august whi indianapoli whi not what i ' m read my read pace ha slow again becaus i ' ve been wrap up watch a show call bosch on amazon prime it ' s a detect show set in la the protagonist harri bosch remind me a littl bit of the pulp fiction detect of 70 ish year ago like phillip marlow and mike hammer differ but similar also i

In [78]:
lemma_article = p.lemmatize(tokenized_article)
lemma_article

"weekly wom 136 18 jun 2023 happy father ' s day all weird how there were two father ' s day this year just kidding i guess i got myself a week ahead last week don ' t worry it ' s not my first mistake my ego ha recovered happening in san antonio i brought home some pastry from la panaderia on friday codeup is in the same building a this restaurant and it ' s a wonder i haven ' t gained more weight these last four month so good the class ahead of mine graduated on friday which mean the end is in sight then i ' ll have to finally grow up maybe and get a job in the meantime i am successfully procrastinating on that task while i look at thing to do in indianapolis caitlyn curtis and i are taking a trip in august why indianapolis why not what i ' m reading my reading pace ha slowed again because i ' ve been wrapped up watching a show called bosch on amazon prime it ' s a detective show set in la the protagonist harry bosch reminds me a little bit of the pulp fiction detective of 70 ish yea

In [79]:
no_stop_article = p.remove_stopwords(lemma_article)
no_stop_article

Removed 207 stopwords.


"weekly wom 136 18 jun 2023 happy father ' day weird two father ' day year kidding guess got week ahead last week ' worry ' first mistake ego ha recovered happening san antonio brought home pastry la panaderia friday codeup building restaurant ' wonder ' gained weight last four month good class ahead mine graduated friday mean end sight ' finally grow maybe get job meantime successfully procrastinating task look thing indianapolis caitlyn curtis taking trip august indianapolis ' reading reading pace ha slowed ' wrapped watching show called bosch amazon prime ' detective show set la protagonist harry bosch reminds little bit pulp fiction detective 70 ish year ago like phillip marlowe mike hammer different similar also inside man netflix thinker four episode miniseries ' listening wa changing oil caitlyn ' car minute ago wa listening father ' day playlist ' couple song give eyeball workout man love zac brown band best day george strait quote week scripture tell u envision ' everyone shal