# Data Acquisition 

## Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import re

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```python
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful.



In [2]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

### find_all is one way to get access to each article link

In [4]:
soup.find_all('a')

[<a href="https://codeup.com/">Home</a>,
 <a href="https://codeup.com/program/cloud-adminsitration/">Cloud Administration</a>,
 <a href="https://codeup.com/program/full-stack-web-development/">Full Stack Web Development</a>,
 <a href="https://codeup.com/program/data-science/">Data Science</a>,
 <a href="/events/">Workshops</a>,
 <a href="/san-antonio-events/">San Antonio</a>,
 <a href="/dallas-events/">Dallas</a>,
 <a href="https://codeup.com/financial-aid/">Financial Aid</a>,
 <a href="https://codeup.com/veterans/">Military</a>,
 <a href="https://codeup.com/hire-tech-talent/">Hire Tech Talent</a>,
 <a href="https://alumni.codeup.com/">Alumni</a>,
 <a href="https://codeup.com/resources/">Resources</a>,
 <a href="/my-story/">Student Reviews</a>,
 <a aria-current="page" href="https://codeup.com/blog/">Blog</a>,
 <a href="https://codeup.com/frequently-asked-questions/">Common Questions</a>,
 <a href="https://codeup.com/podcast/">Hire Tech Podcast</a>,
 <a href="https://codeup.com/apply-no

In [5]:
soup.find_all('a', class_='more-link')

[<a class="more-link" href="https://codeup.com/featured/women-in-tech-panelist-spotlight/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/">read more</a>,
 <a class="more-link" href="https://codeup.com/events/women-in-tech-madeleine/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/panelist-spotlight-4/">read more</a>,
 <a class="more-link" href="https://codeup.com/events/black-excellence-in-tech-panelist-spotlight-stephanie-jones/">read more</a>]

In [6]:
soup.find_all('a', class_='more-link')[0]

<a class="more-link" href="https://codeup.com/featured/women-in-tech-panelist-spotlight/">read more</a>

In [7]:
soup.find_all('a', class_='more-link')[0]['href']

'https://codeup.com/featured/women-in-tech-panelist-spotlight/'

### select is another way to get access to each article link

In [8]:
soup.select('.more-link')

[<a class="more-link" href="https://codeup.com/featured/women-in-tech-panelist-spotlight/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/">read more</a>,
 <a class="more-link" href="https://codeup.com/events/women-in-tech-madeleine/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/panelist-spotlight-4/">read more</a>,
 <a class="more-link" href="https://codeup.com/events/black-excellence-in-tech-panelist-spotlight-stephanie-jones/">read more</a>]

In [9]:
soup.select('.more-link')[0]

<a class="more-link" href="https://codeup.com/featured/women-in-tech-panelist-spotlight/">read more</a>

In [10]:
soup.select('.more-link')[0]['href']

'https://codeup.com/featured/women-in-tech-panelist-spotlight/'

### List comprehension review

In [11]:
[n for n in range(1, 11)]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### Using list comprehension to get all the links out

In [12]:
links = [link['href'] for link in soup.select('.more-link')]
links

['https://codeup.com/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.com/events/women-in-tech-madeleine/',
 'https://codeup.com/codeup-news/panelist-spotlight-4/',
 'https://codeup.com/events/black-excellence-in-tech-panelist-spotlight-stephanie-jones/']

### Get title and content from article

In [13]:
url = links[0]
response = get(url, headers=headers)
soup = BeautifulSoup(response.text)

In [14]:
soup.find('h1', class_='entry-title').text

'Women in tech: Panelist Spotlight – Magdalena Rahn'

In [15]:
soup.find('div', class_='entry-content')

<div class="entry-content">
<h2><strong>Women in tech: Panelist Spotlight – Magdalena Rahn</strong></h2>
<p>Codeup is hosting a <a href="https://www.youtube.com/watch?v=elX8Fm9UIX0" rel="noopener" target="_blank">Women in Tech Panel</a> in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!</p>
<p><img alt="women history spotlight" class="wp-image-19779 alignleft" decoding="async" height="196" sizes="(max-width: 195px) 100vw, 195px" src="https://tribucodeup.wpenginepowered.com/wp-content/uploads/2023/03/382AD943-500C-4B7E-96C0-35009015B02F-298x300.jpeg" srcset="https://tribucodeup.wpenginepowered.com/wp-content/uploads/2023/03/382AD943-500C-4B7E-96C0-35009015B02F-298x300.jpeg 298w, https://tribucodeup.wpenginepowered.com/wp-content/uploads/2023/03/382AD943-500C-4B7E-96C0-35009015B02F-150x150.jpeg 150w, https://t

In [16]:
soup.find('div', class_='entry-content').text.strip()

'Women in tech: Panelist Spotlight – Magdalena Rahn\nCodeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!\n\nMeet Magdalena!\nMagdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.\nWe asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming and data science methods, and 

### Get rid of the title in the content

In [17]:
title = soup.find('h1', class_='entry-title').text

In [18]:
content = soup.find('div', class_='entry-content').text.strip()

In [19]:
content = content.replace(title, '')
content

'\nCodeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!\n\nMeet Magdalena!\nMagdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.\nWe asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming and data science methods, and it’s been an encouragement to have such supportive

### Put it together

In [20]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

links = [link['href'] for link in soup.select('.more-link')]

articles = []

for url in links:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    content = content.replace(title, '')

    
    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [21]:
articles[0:5]

[{'title': 'Women in tech: Panelist Spotlight – Magdalena Rahn',
  'content': '\nCodeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!\n\nMeet Magdalena!\nMagdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.\nWe asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming an

### We still have titles in the content due to mismatched capitalization!

In [22]:
title = soup.find('h1', class_='entry-title').text

In [23]:
content = soup.find('div', class_='entry-content').text.strip()

In [24]:
content

'Black excellence in tech: Panelist Spotlight – Stephanie Jones\nCodeup is hosting our second Black Excellence in Tech Panel in honor of Black History Month on February 22, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as black leaders in the tech industry!\xa0\xa0\nMeet Stephanie!\n\nStephanie Jones is an Alumna of Codeup’s Data Science Program (March 2022) and currently works as a Business Intelligence Developer for Victory Capital, a global investment management firm based in San Antonio, TX.\xa0\nStephanie is passionate about visual storytelling and, as a late “tech bloomer,” is also an advocate for equitable access and educational opportunities in technology for underserved communities.\nWe asked Stephanie to share more about her experience at Codeup. She shares, “I have always been a creative and Codeup opened up a world of possibilities by exposing me to advanced technical c

In [25]:
compiled = re.compile(re.escape(title), re.IGNORECASE)
content = compiled.sub('', content)
content

'\nCodeup is hosting our second Black Excellence in Tech Panel in honor of Black History Month on February 22, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as black leaders in the tech industry!\xa0\xa0\nMeet Stephanie!\n\nStephanie Jones is an Alumna of Codeup’s Data Science Program (March 2022) and currently works as a Business Intelligence Developer for Victory Capital, a global investment management firm based in San Antonio, TX.\xa0\nStephanie is passionate about visual storytelling and, as a late “tech bloomer,” is also an advocate for equitable access and educational opportunities in technology for underserved communities.\nWe asked Stephanie to share more about her experience at Codeup. She shares, “I have always been a creative and Codeup opened up a world of possibilities by exposing me to advanced technical concepts and career opportunities.” Stephanie also speaks to th

In [26]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

links = [link['href'] for link in soup.select('.more-link')]

articles = []

for url in links:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    compiled = re.compile(re.escape(title), re.IGNORECASE)
    content = compiled.sub('', content)
    

    
    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [27]:
articles[0:5]

[{'title': 'Women in tech: Panelist Spotlight – Magdalena Rahn',
  'content': '\nCodeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!\n\nMeet Magdalena!\nMagdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.\nWe asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming an

### Put in df

In [28]:
blog_article_df = pd.DataFrame(articles)
blog_article_df

Unnamed: 0,title,content
0,Women in tech: Panelist Spotlight – Magdalena ...,\nCodeup is hosting a Women in Tech Panel in h...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nCodeup is hosting a Women in Tech Panel in h...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nCodeup is hosting a Women in Tech Panel in ...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nCodeup is hosting a Women in Tech Panel in h...
4,Black Excellence in Tech: Panelist Spotlight –...,\n\nCodeup is hosting a Black Excellence in Te...
5,Black excellence in tech: Panelist Spotlight –...,\nCodeup is hosting our second Black Excellenc...


In [29]:
blog_article_df.to_csv('blog_articles.csv', index=False)

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```python
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [30]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Get title

In [31]:
soup.find_all('span', itemprop='headline')

[<span itemprop="headline">Petition filed in HC challenging 'The Kerala Story' ban in Bengal</span>,
 <span itemprop="headline">Kriti Sanon wears saree with 24-carat gold print, Abu Jani &amp; Sandeep Khosla share pics</span>,
 <span itemprop="headline">Haryana CM makes 'The Kerala Story' tax-free in the state </span>,
 <span itemprop="headline">It seems he bats on a computer: Sourav Ganguly on Suryakumar Yadav's 83(35) vs RCB</span>,
 <span itemprop="headline">CSK defend their lowest total in IPL 2023, win 7th match in a row vs DC in Chennai</span>,
 <span itemprop="headline">MS Dhoni mock-hits Deepak Chahar, frightens him after toss; video goes viral</span>,
 <span itemprop="headline">BCCI projected to earn ₹1,885 crore per year in ICC's new finance model: Report</span>,
 <span itemprop="headline">Sergio Busquets leaving Barcelona after 18 years amid talks over Saudi move</span>,
 <span itemprop="headline">Which Indian unicorns saw valuation markdowns in recent times?</span>,
 <span 

In [32]:
soup.find_all('span', itemprop='headline')[0].text

"Petition filed in HC challenging 'The Kerala Story' ban in Bengal"

### Get content

In [33]:
soup.find_all('div', itemprop='articleBody')

[<div itemprop="articleBody">After West Bengal CM Mamata Banerjee banned the movie 'The Kerala Story' on Monday, a public interest litigation (PIL) has been filed against the same in Calcutta High Court. The petitioner on Wednesday told the court that the state government's decision is against the right to freedom of speech. The case will be heard by the court on May 15.</div>,
 <div itemprop="articleBody">Kriti Sanon wore a saree by Abu Jani and Sandeep Khosla for an event. Sharing pictures, the designers wrote, "Kriti Sanon is a vision in a double-drape saree featuring a mix off-white khadi saree with zardozi border and a vintage Kerala cotton saree with 24-carat gold Khadi block print." "A Diva...so good in Indian wear," an Instagram user commented.</div>,
 <div itemprop="articleBody">Haryana CM Manohar Lal Khattar on Wednesday said that 'The Kerala Story' has been made tax-free in the state. This comes after the movie was made tax-free in other BJP ruled states, including UP, MP an

In [34]:
soup.find_all('div', itemprop='articleBody')[0].text

"After West Bengal CM Mamata Banerjee banned the movie 'The Kerala Story' on Monday, a public interest litigation (PIL) has been filed against the same in Calcutta High Court. The petitioner on Wednesday told the court that the state government's decision is against the right to freedom of speech. The case will be heard by the court on May 15."

### Get categories

In [35]:
soup.select('li')

[<li class="active-category selected">All News</li>,
 <li class="active-category">India</li>,
 <li class="active-category">Business</li>,
 <li class="active-category">Sports</li>,
 <li class="active-category">World</li>,
 <li class="active-category">Politics</li>,
 <li class="active-category">Technology</li>,
 <li class="active-category">Startup</li>,
 <li class="active-category">Entertainment</li>,
 <li class="active-category">Miscellaneous</li>,
 <li class="active-category">Hatke</li>,
 <li class="active-category">Science</li>,
 <li class="active-category">Automobile</li>]

In [36]:
[li.text.lower() for li in soup.select('li')][1:]

['india',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

In [37]:
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'
categories

['national',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

### Put it together

In [38]:
# catorgories = ['business', 'sports', 'technology', 'entertainment']
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'

inshorts = []

for category in categories:
    
    url = 'https://inshorts.com/en/read' + '/' + category
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles = [span.text for span in soup.find_all('span', itemprop='headline')]
    contents = [div.text for div in soup.find_all('div', itemprop='articleBody')]
    
    for i in range(len(titles)):
        
        article = {
            'title': titles[i],
            'content': contents[i],
            'category': category,
        }
        
        inshorts.append(article)

In [39]:
inshorts[0:5]

[{'title': "Petition filed in HC challenging 'The Kerala Story' ban in Bengal",
  'content': "After West Bengal CM Mamata Banerjee banned the movie 'The Kerala Story' on Monday, a public interest litigation (PIL) has been filed against the same in Calcutta High Court. The petitioner on Wednesday told the court that the state government's decision is against the right to freedom of speech. The case will be heard by the court on May 15.",
  'category': 'national'},
 {'title': "Haryana CM makes 'The Kerala Story' tax-free in the state ",
  'content': "Haryana CM Manohar Lal Khattar on Wednesday said that 'The Kerala Story' has been made tax-free in the state. This comes after the movie was made tax-free in other BJP ruled states, including UP, MP and Uttarakhand. However, the movie was banned in West Bengal. It was done to avoid incidents of hatred in the state, CM Mamata Banerjee said. ",
  'category': 'national'},
 {'title': "Exit polls are done in hurry, there will be many errors: K'ta

In [40]:
inshorts_article_df = pd.DataFrame(inshorts)
inshorts_article_df

Unnamed: 0,title,content,category
0,Petition filed in HC challenging 'The Kerala S...,After West Bengal CM Mamata Banerjee banned th...,national
1,Haryana CM makes 'The Kerala Story' tax-free i...,Haryana CM Manohar Lal Khattar on Wednesday sa...,national
2,"Exit polls are done in hurry, there will be ma...",As multiple exit polls predicted Congress' lea...,national
3,7 members of family burnt to death as fire bre...,"Seven members of a family, including four mino...",national
4,Rahul Gandhi issued notice over his 'unannounc...,Days after he visited Delhi University's PG Me...,national
...,...,...,...
291,"Volvo Cars to slash 1,300 white-collar jobs to...",The Chinese-owned Volvo Cars of Sweden said in...,automobile
292,EV ecosystem still not mature in India: Renaul...,"Venkatram Mamillapalle, Renault India MD and C...",automobile
293,Lordstown shares fall 25% as Foxconn alleges $...,Lordstown Motors' shares fell by 25% after maj...,automobile
294,Honda Cars records 33% Y-o-Y dip in domestic s...,Honda Cars India recorded a 33% year-on-year d...,automobile


In [41]:
inshorts_article_df.to_csv('news_articles.csv', index=False)