# Data Acquisition 

## Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```python
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful.



In [2]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

### class='more-link' is one way to get access to each article link

In [4]:
soup.select('.more-link') #soup.find_all('a', class_='more-link')

[<a class="more-link" href="https://codeup.com/data-science/become-a-data-scientist/">read more</a>,
 <a class="more-link" href="https://codeup.com/employers/hiring-tech-talent/">read more</a>,
 <a class="more-link" href="https://codeup.com/cloud-administration/cap-funding-options/">read more</a>,
 <a class="more-link" href="https://codeup.com/dallas-info/it-professionals-dallas/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/codeup-voted-1-technical-school-in-dfw/">read more</a>,
 <a class="more-link" href="https://codeup.com/tips-for-prospective-students/financing/codeups-scholarships/">read more</a>]

In [5]:
soup.select('.more-link')[0]

<a class="more-link" href="https://codeup.com/data-science/become-a-data-scientist/">read more</a>

In [6]:
soup.select('.more-link')[0]['href']

'https://codeup.com/data-science/become-a-data-scientist/'

### List comprehension review

In [7]:
[n for n in range(1, 11)]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### Using list comprehension to get all the links out

In [8]:
links = [link['href'] for link in soup.select('.more-link')]
links

['https://codeup.com/data-science/become-a-data-scientist/',
 'https://codeup.com/employers/hiring-tech-talent/',
 'https://codeup.com/cloud-administration/cap-funding-options/',
 'https://codeup.com/dallas-info/it-professionals-dallas/',
 'https://codeup.com/codeup-news/codeup-voted-1-technical-school-in-dfw/',
 'https://codeup.com/tips-for-prospective-students/financing/codeups-scholarships/']

### Get title and content from article

In [9]:
url = links[0]
response = get(url, headers=headers)
soup = BeautifulSoup(response.text)

In [10]:
soup.find('h1', class_='entry-title').text

'Become a Data Scientist in 6 Months!'

In [13]:
soup.find('div', class_='entry-content').text.strip()

'Are you feeling unfulfilled in your work but want to avoid returning to the traditional educational route? Codeup can help! Starting over as a professional is daunting and not always ideal. Codeup can help you go from a career you are bored with, to a job that excites you in just 6 months!\nHere’s how…\nData Science Program\nDuring our 20-week program, you will have the opportunity to take your career to new heights with data science being one of the most needed jobs in tech.\nYou’ll gather data, then clean it, explore it for trends, and apply machine learning models to make predictions.\nUpon completing this program, you will know how to turn insights into actionable recommendations. You’ll be a huge asset to any company, having all the technical skills to become a data scientist with projects upon projects of experience under your belt.\nCodeup\nA common reason individuals opt not to change their careers is fear it is too late. Codeup has crafted a program that will guide you throug

### Put it together

In [14]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

links = [link['href'] for link in soup.select('.more-link')]

articles = []

for url in links:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    
    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [15]:
articles[0:5]

[{'title': 'Become a Data Scientist in 6 Months!',
  'content': 'Are you feeling unfulfilled in your work but want to avoid returning to the traditional educational route? Codeup can help! Starting over as a professional is daunting and not always ideal. Codeup can help you go from a career you are bored with, to a job that excites you in just 6 months!\nHere’s how…\nData Science Program\nDuring our 20-week program, you will have the opportunity to take your career to new heights with data science being one of the most needed jobs in tech.\nYou’ll gather data, then clean it, explore it for trends, and apply machine learning models to make predictions.\nUpon completing this program, you will know how to turn insights into actionable recommendations. You’ll be a huge asset to any company, having all the technical skills to become a data scientist with projects upon projects of experience under your belt.\nCodeup\nA common reason individuals opt not to change their careers is fear it is t

### Put in df

In [16]:
blog_article_df = pd.DataFrame(articles)
blog_article_df

Unnamed: 0,title,content
0,Become a Data Scientist in 6 Months!,Are you feeling unfulfilled in your work but w...
1,Hiring Tech Talent Around the Holidays,Are you a hiring manager having trouble fillin...
2,Cloud Administration Program New Funding Options,Finding resources to fund your educational goa...
3,Why Dallas is a Great Location for IT Professi...,"When breaking into a new career, it is importa..."
4,Codeup is ranked #1 Best in DFW 2022,We are excited to announce that Codeup ranked ...
5,Codeup’s Scholarship Offerings,In honor of November being National Scholarshi...


In [17]:
blog_article_df.to_csv('blog_articles.csv', index=False)

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```python
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [18]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Get title

In [21]:
soup.find_all('span', itemprop='headline')[0].text

"Drunk man smoked in toilet, another peed on woman's blanket on Air India flight"

### Get content

In [22]:
soup.find_all('div', itemprop='articleBody')[0].text

"A drunk passenger smoked in the toilet on a Paris-Delhi Air India flight on December 6, the DGCA said. This is the same Paris-Delhi flight on which another drunk man urinated on a woman co-passenger's blanket when she went to the lavatory. Separately, Shankar Mishra was arrested for urinating on a woman on Air India's November 26 New York-Delhi flight."

### Get categories

In [23]:
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'
categories

['national',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

### Put it together

In [24]:
# catorgories = ['business', 'sports', 'technology', 'entertainment']
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'

inshorts = []

for category in categories:
    
    url = 'https://inshorts.com/en/read' + '/' + category
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles = [span.text for span in soup.find_all('span', itemprop='headline')]
    contents = [div.text for div in soup.find_all('div', itemprop='articleBody')]
    
    for i in range(len(titles)):
        
        article = {
            'title': titles[i],
            'content': contents[i],
            'category': category,
        }
        
        inshorts.append(article)

In [25]:
inshorts[0:5]

[{'title': "Drunk man smoked in toilet, another peed on woman's blanket on Air India flight",
  'content': "A drunk passenger smoked in the toilet on a Paris-Delhi Air India flight on December 6, the DGCA said. This is the same Paris-Delhi flight on which another drunk man urinated on a woman co-passenger's blanket when she went to the lavatory. Separately, Shankar Mishra was arrested for urinating on a woman on Air India's November 26 New York-Delhi flight.",
  'category': 'national'},
 {'title': 'Coaching centre run by Rajasthan paper leak case accused demolished in Jaipur',
  'content': 'The Jaipur Development Authority (JDA) on Monday demolished a five-storey building of a coaching centre run by Suresh Dhaka, whose name appeared in the second-grade teacher recruitment examination paper leak case. The JDA found that the Adhigam Coaching Centre building was built in violation of laws. The coaching institute was served the notice twice, an official said.',
  'category': 'national'},
 

In [26]:
inshorts_article_df = pd.DataFrame(inshorts)
inshorts_article_df

Unnamed: 0,title,content,category
0,"Drunk man smoked in toilet, another peed on wo...",A drunk passenger smoked in the toilet on a Pa...,national
1,Coaching centre run by Rajasthan paper leak ca...,The Jaipur Development Authority (JDA) on Mond...,national
2,Temporary ban on BS-III petrol & BS-IV diesel ...,The Delhi government has decided to impose a t...,national
3,"Joshimath divided into 3 zones, govt says most...","Uttarakhand's Joshimath, where a majority of b...",national
4,"4 dead, 10 hurt as bus hits truck amid dense f...",At least four people were killed after a speed...,national
...,...,...,...
290,Electric two-wheeler sales cross 6-lakh mark i...,Electric two-wheeler (E2W) sales in 2022 hit a...,automobile
291,Tesla shares fall further after firm misses 20...,"Tesla's shares, which dipped roughly 65% last ...",automobile
292,GM beats Toyota to reclaim top US automaker sp...,General Motors (GM) reclaimed the top US autom...,automobile
293,Rolls-Royce reports record car sales in 118 years,Luxury car maker Rolls-Royce has sold 6021 car...,automobile


In [27]:
inshorts_article_df.to_csv('news_articles.csv', index=False)