# Data Acquisition

### Why is Machine Learning Difficult?

- The answer is that the data that required to train computers is most of the times **not available**.
- And in other cases, data is in **a very raw format that requires a lot of cleaning and feature engineering**.

### What are some common ways to collect data?

1. Collection from an publicly available data repository (files). ✅
2. Web APIs (JSON)
3. Scraping a website (check for legality).
4. Databases (SQL / NoSQL)

# Problem Statement: How to get all tweets with a tag `omicron`?
Get all the tweet for from twitter talking about `#omicron `

`Transition question`: **Does Twitter offer some sort of service that allows to query data in this manner?**

## WWW


The World Wide Web is about communication between web clients and web servers.

Clients are often browsers (Chrome, Edge, Safari), but they can be any type of program or device.

Servers are most often computers in the cloud.



![Screenshot%202022-06-15%20at%206.56.51%20PM.png](attachment:Screenshot%202022-06-15%20at%206.56.51%20PM.png)



## HTTP Request / Response
Communication between clients and servers is done by requests and responses:

A client (a browser) sends an HTTP request to the web
A web server receives the request
The server runs an application to process the request
The server returns an HTTP response (output) to the browser
The client (the browser) receives the response


## The HTTP Request Circle

A typical HTTP request / response circle:

The browser requests an HTML page. The server returns an HTML file.
The browser requests a style sheet. The server returns a CSS file.
The browser requests an JPG image. The server returns a JPG file.
The browser requests JavaScript code. The server returns a JS file
The browser requests data. The server returns data (in XML or JSON).

## The GET Method
GET is used to request data from a specified resource.

Note that the query string (name/value pairs) is sent in the URL of a GET request:

`/test/demo_form.php?name1=value1&name2=value2`

**Some notes on GET requests:**

- GET requests can be cached
- GET requests remain in the browser history
- GET requests can be bookmarked
- GET requests should never be used when dealing with sensitive data
- GET requests have length restrictions
- GET requests are only used to request data (not modify)


## The POST Method
POST is used to send data to a server to create/update a resource.

The data sent to the server with POST is stored in the request body of the HTTP request:

```python
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com

name1=value1&name2=value2
```

**Some notes on POST requests:**

- POST requests are never cached
- POST requests do not remain in the browser history
- POST requests cannot be bookmarked
- POST requests have no restrictions on data length

# What is an API?
Application Programming Interface (API), is a software that allows two applications to talk to each other (exchanging the data). Each time you check the weather on your phone, or using a Google Service, you're using an API.

**APIs are just like the function calls, but those functions are sitting on the web server and API is the way to invoke those functions in your program**.
1. We can send a request to the web server (to its API) to get the data.
2. In return if the call is successfully made the API returns us the data mostly in the **`json` format**.

There are some websites or APIs those offers are open to all and provide free data. Whereas mostly APIs are paid and require some sort of Authentication with the API Keys.

### Use cases of APIs:
- APIs can be used to call a function on web server to perform a task and return the required response.
- APIs can also be used to get/send the data over the internet.


Let's first look at few free web APIs and then we will explore paid web api.



## How to make these API calls?

Number of things that you should know while making a request on the web:
1. Protocol - HTTP
2. Authentication credentials for the API being called.
3. Functional requirements of the API (URL for the endpoint) - parameters, syntax, and sample response structure (is it JSON or something else) - using the language of your choice.
4. [Optional]: Are there 3rd party solutions available to make these requests easy? - e.g.: `yahoofinance`, `tweepy`



#### 1. How to send an HTTP request using Python?

**requests** package

In [None]:
# !pip install requests

In [1]:
import requests
import json

In [2]:
url = "https://api.ipify.org"


In [3]:
response = requests.get(url)
response

<Response [200]>

In [4]:
response.status_code

200

In [5]:
response.text

'122.161.80.49'

## Getting `omicron` tagged tweets

In [6]:
!pip install tweepy



In [7]:
import tweepy

In [9]:
## add bearer token

client = tweepy.Client(bearer_token="AAAAAAAAAAAAAAAAAAAAAPPSYwEAAAAARlorVWosZZfuZs5m7xGT8uzk9yQ%3DeAdNOA2H4VKQQezv2SQK4sqP9B3c3ViUMHibevOG7dkATAKR3b")




## Extract tweets from an account

In [11]:
query = 'from:scaler_official -is:retweet'


tweets = client.search_recent_tweets(query=query, tweet_fields = ['created_at', 'author_id'], 
                                     max_results=100)

print(len(tweets.data))




11


In [15]:
for tweet in tweets.data:
    print(tweet.text)
    print(tweet.author_id)
    print('---------')

We read the queries you sent on our DM and came up with something super helpful. A #career planning tool by #SCALER, that will help answer all the queries you have on building a career in tech and scoring a job at your dream company.
https://t.co/lw1EBpLdcw

#CreateImpact https://t.co/uwu2FSyeN4
1194875574331699200
---------
Modern technology would be unthinkable without mathematics. The relationship is reciprocal since mathematics also needs #technology.

Math 2.0 Day 

#MathDay https://t.co/5rkbx7k3A7
1194875574331699200
---------
#CodersOffline is a conclave that brings together tech enthusiasts and bridges the gap between a learner and an expert. 

Here is a glimpse of our interactions in Hyderabad, Pune and Mumbai.

Coming next to Chennai, Kolkata and Delhi.
Stay tuned for more.

#CreateImpact https://t.co/qXp7rc35Zb
1194875574331699200
---------
Building data pipelines that will process your data and move it from one system to another is already a challenge. 
To understand other 

In [16]:
##. getting twitter user id from username
client.get_user(username="elonmusk")

Response(data=<User id=44196397 name=Elon Musk username=elonmusk>, includes={}, errors=[], meta={})

In [21]:
## get the following list of Elon Musk

fol_ing_list = client.get_users_following(id="44196397")
len(fol_ing_list.data)



BadRequest: 400 Bad Request
The `id` query parameter value [elonmusk] is not valid

In [20]:
fol_ing_list.data

[<User id=72862939 name=HIRO MIZUNO username=hiromichimizuno>,
 <User id=34719119 name=Walter Isaacson username=WalterIsaacson>,
 <User id=1605 name=Sam Altman username=sama>,
 <User id=18989355 name=Mike Solana username=micsolana>,
 <User id=300878435 name=Dilbert username=Dilbert_Daily>,
 <User id=541882699 name=Andrea Stroppa username=Andst7>,
 <User id=38271276 name=Matt Taibbi username=mtaibbi>,
 <User id=11417802 name=Joe Gebbia username=jgebbia>,
 <User id=895332160130891776 name=Neuralink username=neuralink>,
 <User id=276540738 name=𝔊𝔯𝔦𝔪𝔢𝔰 (⌛️,⏳) ᚷᚱᛁᛗᛖᛋ username=Grimezsz>,
 <User id=1231406720 name=Michael Sheetz username=thesheetztweetz>,
 <User id=959471389282578432 name=Eva Fox 🦊🇺🇦 Shadow Crew username=EvaFoxU>,
 <User id=17663776 name=Planet username=planet>,
 <User id=2292565884 name=PCMR username=OfficialPCMR>,
 <User id=2993230373 name=Universal-Sci username=universal_sci>,
 <User id=1179892477714718721 name=Science girl username=gunsnrosesgirl3>,
 <User id=76980293 nam

In [22]:
query = "#omicron -is:retweet"

data = {
    'text': [],
    'created_at': [],
    'author_id': []
}


for tweet in tweepy.Paginator(client.search_recent_tweets, query=query,
                             tweet_fields=['created_at', 'author_id'],
                             max_results=100).flatten(1000):
    data['text'].append(tweet.text)
    data['created_at'].append(tweet.created_at)
    data['author_id'].append(tweet.author_id)



In [23]:
data

{'text': ['hospitalizations from #COVID19 are on the rise again in B.C. #covid19BC #omicron 😷\nhttps://t.co/7B2XN0naV9 https://t.co/9RKuM49IK1',
  'BA.5 is here to party. Assess your risk accordingly. #covid19 #omicron #pandemic\n\nhttps://t.co/fdQTgGrtUA',
  'Get Ready for the Forever Plague  via @TheTyee https://t.co/u5eUHqfjdH  #TheForeverPlague #TheTyee #TheHardTruth #Covid_19 #Omicron #subvariants #BA5 #SARSCoV2 #CovidVaccines #CovidVaccinations #health #pandemic #ImmuneSystem',
  'Newer iterations of the #Omicron variant, BA.4 and BA.5 are spreading rapidly and poised to outcompete past versions of the virus, extending the current COVID-19 surge in #Bangladesh. \n\nSource @icddr_b : https://t.co/ir7fyDeMoH https://t.co/7UrSALxK7f',
  'Esto ya parece el Apocalipsis, pero es apenas un aviso.\nAlerta OMS por la nueva variante Covid: Centaurus que es cinco veces más contagiosa que #Omicron \nY luego en #Monterrey #Mx con escasez de agua, los protocolos sanitarios se complican.\nhttps

In [24]:
import pandas as pd

df = pd.DataFrame(data)
df.head()

Unnamed: 0,text,created_at,author_id
0,hospitalizations from #COVID19 are on the rise...,2022-07-08 16:55:48+00:00,15733528
1,BA.5 is here to party. Assess your risk accord...,2022-07-08 16:53:25+00:00,14355077
2,Get Ready for the Forever Plague via @TheTyee...,2022-07-08 16:49:34+00:00,846779868
3,"Newer iterations of the #Omicron variant, BA.4...",2022-07-08 16:49:00+00:00,24401670
4,"Esto ya parece el Apocalipsis, pero es apenas ...",2022-07-08 16:47:45+00:00,160323306


In [26]:
df.to_csv("tweets.csv", index=False)

## OpenWeathermap API

In [29]:
url = "https://api.openweathermap.org/data/2.5/weather?q=delhi&appid=9b199c2b6cd2fbda47fcd3fcfee5123b"


In [30]:
r = requests.get(url)
r.text

'{"coord":{"lon":77.2167,"lat":28.6667},"weather":[{"id":721,"main":"Haze","description":"haze","icon":"50n"}],"base":"stations","main":{"temp":309.2,"feels_like":314.68,"temp_min":308.88,"temp_max":309.2,"pressure":1000,"humidity":46},"visibility":4000,"wind":{"speed":2.57,"deg":90},"clouds":{"all":40},"dt":1657299517,"sys":{"type":1,"id":9165,"country":"IN","sunrise":1657238362,"sunset":1657288352},"timezone":19800,"id":1273294,"name":"Delhi","cod":200}'

In [31]:
weather = json.loads(r.text)
weather

{'coord': {'lon': 77.2167, 'lat': 28.6667},
 'weather': [{'id': 721,
   'main': 'Haze',
   'description': 'haze',
   'icon': '50n'}],
 'base': 'stations',
 'main': {'temp': 309.2,
  'feels_like': 314.68,
  'temp_min': 308.88,
  'temp_max': 309.2,
  'pressure': 1000,
  'humidity': 46},
 'visibility': 4000,
 'wind': {'speed': 2.57, 'deg': 90},
 'clouds': {'all': 40},
 'dt': 1657299517,
 'sys': {'type': 1,
  'id': 9165,
  'country': 'IN',
  'sunrise': 1657238362,
  'sunset': 1657288352},
 'timezone': 19800,
 'id': 1273294,
 'name': 'Delhi',
 'cod': 200}

## Scraping information from webpages

In [32]:
!pip install beautifulsoup4



In [56]:
baseurl = "http://books.toscrape.com/"

r = requests.get(baseurl)

r.content



In [57]:
from bs4 import BeautifulSoup

In [58]:
soup = BeautifulSoup(r.content)
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [59]:
ul_list = soup.find('ul', class_="nav-list")


In [60]:
ul_li_items = ul_list.ul.find_all('li')
baseurl + ul_li_items[0].a['href']

'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'

In [61]:
def extract_categories(baseurl):
    categories = {}
    r = requests.get(baseurl)
    soup = BeautifulSoup(r.content)
    categories_list = soup.find('ul', class_="nav-list").ul.find_all("li")
    for li in categories_list:
        categories.update({li.text.strip(): baseurl + li.a['href']})
    return categories

In [62]:
extract_categories(baseurl)

{'Travel': 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'Mystery': 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'Historical Fiction': 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'Sequential Art': 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'Classics': 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'Philosophy': 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'Romance': 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'Womens Fiction': 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'Fiction': 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'Childrens': 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'Religion': 'http://books.toscrape.com/catalogue/category/books/rel

## Extracting all book info

In [63]:
url = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
res = requests.get(url)

soup = BeautifulSoup(res.content)

In [64]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    Mystery | 
     Books to Scrape - Sandbox

</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    
" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../../../../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../../../../static/oscar/css/styles.css" rel="stylesheet" type

In [73]:
data = {
    'product_page_url': [],
    'title': [],
    'price_including_tax': [],
    'number_available': []
}

In [66]:
book_page_list = soup.find('ol', class_="row").find_all("li")
book_page_list

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="../../../../media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>
 </div>
 <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
 <div class="product_price">
 <p class="price_color">£47.82</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../.

In [74]:
for li in book_page_list:
    href_split = li.a['href'].split("/")[3:]
    book_page_link = "http://books.toscrape.com/catalogue/" + "/".join(href_split)
    
    print(book_page_link)
    
    req = requests.get(book_page_link)
    soup = BeautifulSoup(req.content, features="html.parser")
    
    data['product_page_url'].append(book_page_link)
    
    rows = soup.find("table").find_all("tr")
    
    title = soup.find('title').text.strip()
    data['title'].append(title)
    
    data['price_including_tax'].append(rows[3].find('td').text.strip())
    data['number_available'].append(rows[5].find('td').text.strip())
    


http://books.toscrape.com/catalogue/sharp-objects_997/index.html
http://books.toscrape.com/catalogue/in-a-dark-dark-wood_963/index.html
http://books.toscrape.com/catalogue/the-past-never-ends_942/index.html
http://books.toscrape.com/catalogue/a-murder-in-time_877/index.html
http://books.toscrape.com/catalogue/the-murder-of-roger-ackroyd-hercule-poirot-4_852/index.html
http://books.toscrape.com/catalogue/the-last-mile-amos-decker-2_754/index.html
http://books.toscrape.com/catalogue/that-darkness-gardiner-and-renner-1_743/index.html
http://books.toscrape.com/catalogue/tastes-like-fear-di-marnie-rome-3_742/index.html
http://books.toscrape.com/catalogue/a-time-of-torment-charlie-parker-14_657/index.html
http://books.toscrape.com/catalogue/a-study-in-scarlet-sherlock-holmes-1_656/index.html
http://books.toscrape.com/catalogue/poisonous-max-revere-novels-3_627/index.html
http://books.toscrape.com/catalogue/murder-at-the-42nd-street-library-raymond-ambler-1_624/index.html
http://books.toscrap

In [68]:
data

{'product_page_url': ['http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'http://books.toscrape.com/catalogue/in-a-dark-dark-wood_963/index.html',
  'http://books.toscrape.com/catalogue/the-past-never-ends_942/index.html',
  'http://books.toscrape.com/catalogue/a-murder-in-time_877/index.html',
  'http://books.toscrape.com/catalogue/the-murder-of-roger-ackroyd-hercule-poirot-4_852/index.html',
  'http://books.toscrape.com/catalogue/the-last-mile-amos-decker-2_754/index.html',
  'http://books.toscrape.com/catalogue/that-darkness-gardiner-and-renner-1_743/index.html',
  'http://books.toscrape.com/catalogue/tastes-like-fear-di-marnie-rome-3_742/index.html',
  'http://books.toscrape.com/catalogue/a-time-of-torment-charlie-parker-14_657/index.html',
  'http://books.toscrape.com/catalogue/a-study-in-scarlet-sherlock-holmes-1_656/index.html',
  'http://books.toscrape.com/catalogue/poisonous-max-revere-novels-3_627/index.html',
  'http://books.toscrape.com/catalogue/murder-a

In [75]:
df = pd.DataFrame(data)
df

Unnamed: 0,product_page_url,title,price_including_tax,number_available
0,http://books.toscrape.com/catalogue/sharp-obje...,Sharp Objects | Books to Scrape - Sandbox,£47.82,In stock (20 available)
1,http://books.toscrape.com/catalogue/in-a-dark-...,"In a Dark, Dark Wood | Books to Scrape - Sandbox",£19.63,In stock (18 available)
2,http://books.toscrape.com/catalogue/the-past-n...,The Past Never Ends | Books to Scrape - Sandbox,£56.50,In stock (16 available)
3,http://books.toscrape.com/catalogue/a-murder-i...,A Murder in Time | Books to Scrape - Sandbox,£16.64,In stock (16 available)
4,http://books.toscrape.com/catalogue/the-murder...,The Murder of Roger Ackroyd (Hercule Poirot #4...,£44.10,In stock (15 available)
5,http://books.toscrape.com/catalogue/the-last-m...,The Last Mile (Amos Decker #2) | Books to Scra...,£54.21,In stock (14 available)
6,http://books.toscrape.com/catalogue/that-darkn...,That Darkness (Gardiner and Renner #1) | Books...,£13.92,In stock (14 available)
7,http://books.toscrape.com/catalogue/tastes-lik...,Tastes Like Fear (DI Marnie Rome #3) | Books t...,£10.69,In stock (14 available)
8,http://books.toscrape.com/catalogue/a-time-of-...,A Time of Torment (Charlie Parker #14) | Books...,£48.35,In stock (14 available)
9,http://books.toscrape.com/catalogue/a-study-in...,A Study in Scarlet (Sherlock Holmes #1) | Book...,£16.73,In stock (14 available)


## Scrapy