# Data Acquisition

### Why is Machine Learning Difficult?

- The answer is that the data that required to train computers is most of the times **not available**.
- And in other cases, data is in **a very raw format that requires a lot of cleaning and feature engineering**.

### What are some common ways to collect data?

1. Collection from an publicly available data repository (files). ✅
2. Web APIs (JSON)
3. Scraping a website (check for legality).
4. Databases (SQL / NoSQL)

# Problem Statement: How to get all tweets with a tag `omicron`?
Get all the tweet for from twitter talking about `#omicron `

`Transition question`: **Does Twitter offer some sort of service that allows to query data in this manner?**

## WWW


The World Wide Web is about communication between web clients and web servers.

Clients are often browsers (Chrome, Edge, Safari), but they can be any type of program or device.

Servers are most often computers in the cloud.



![Screenshot%202022-06-15%20at%206.56.51%20PM.png](attachment:Screenshot%202022-06-15%20at%206.56.51%20PM.png)



## HTTP Request / Response
Communication between clients and servers is done by requests and responses:

A client (a browser) sends an HTTP request to the web
A web server receives the request
The server runs an application to process the request
The server returns an HTTP response (output) to the browser
The client (the browser) receives the response


## The HTTP Request Circle

A typical HTTP request / response circle:

The browser requests an HTML page. The server returns an HTML file.
The browser requests a style sheet. The server returns a CSS file.
The browser requests an JPG image. The server returns a JPG file.
The browser requests JavaScript code. The server returns a JS file
The browser requests data. The server returns data (in XML or JSON).

## The GET Method
GET is used to request data from a specified resource.

Note that the query string (name/value pairs) is sent in the URL of a GET request:

`/test/demo_form.php?name1=value1&name2=value2`

**Some notes on GET requests:**

- GET requests can be cached
- GET requests remain in the browser history
- GET requests can be bookmarked
- GET requests should never be used when dealing with sensitive data
- GET requests have length restrictions
- GET requests are only used to request data (not modify)


## The POST Method
POST is used to send data to a server to create/update a resource.

The data sent to the server with POST is stored in the request body of the HTTP request:

```python
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com

name1=value1&name2=value2
```

**Some notes on POST requests:**

- POST requests are never cached
- POST requests do not remain in the browser history
- POST requests cannot be bookmarked
- POST requests have no restrictions on data length

# What is an API?
Application Programming Interface (API), is a software that allows two applications to talk to each other (exchanging the data). Each time you check the weather on your phone, or using a Google Service, you're using an API.

**APIs are just like the function calls, but those functions are sitting on the web server and API is the way to invoke those functions in your program**.
1. We can send a request to the web server (to its API) to get the data.
2. In return if the call is successfully made the API returns us the data mostly in the **`json` format**.

There are some websites or APIs those offers are open to all and provide free data. Whereas mostly APIs are paid and require some sort of Authentication with the API Keys.

### Use cases of APIs:
- APIs can be used to call a function on web server to perform a task and return the required response.
- APIs can also be used to get/send the data over the internet.


Let's first look at few free web APIs and then we will explore paid web api.



## How to make these API calls?

Number of things that you should know while making a request on the web:
1. Protocol - HTTP
2. Authentication credentials for the API being called.
3. Functional requirements of the API (URL for the endpoint) - parameters, syntax, and sample response structure (is it JSON or something else) - using the language of your choice.
4. [Optional]: Are there 3rd party solutions available to make these requests easy? - e.g.: `yahoofinance`, `tweepy`



#### 1. How to send an HTTP request using Python?

**requests** package

In [1]:
!pip install requests json

[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m
[31mERROR: No matching distribution found for json[0m


In [2]:
import requests
import json

In [3]:
url = "https://api.ipify.org"

response = requests.get(url)

In [6]:
response.status_code

200

In [7]:
response.text

'122.161.86.73'

## Twitter API

1. Get recent tweets from Scaler's account.
2. Get a list of all the people that a user is following.

In [8]:
!pip install tweepy



In [9]:
import tweepy

In [10]:
##authenticating myself with bearer token
client = tweepy.Client(bearer_token="<create-your-token-using-developer-portal>")




In [11]:
client.get_user(username="IAmClintMurphy")

Response(data=<User id=965699949512900608 name=Clint Murphy username=IAmClintMurphy>, includes={}, errors=[], meta={})

In [12]:
client.get_users_following(965699949512900608)

Response(data=[<User id=48139266 name=Phillip Rivers 🎱 username=thePhilRivers>, <User id=733866952045694977 name=Jillian Johnsrud username=JillianJohnsrud>, <User id=14879002 name=Jesse Pujji username=jspujji>, <User id=299754541 name=Gia Macool username=GiaMMacool>, <User id=99011479 name=Andrew Moses username=andrewhmoses>, <User id=1510578497068183558 name=Felix Hazlehurst username=stoic_wealth>, <User id=1137172434212331524 name=Mike Johnson username=MikeJohnson1_>, <User id=1328837956321374208 name=Jeremy Singh username=singhcredible>, <User id=929464603439923201 name=Brooks username=OfficialBBrooks>, <User id=1453368613 name=VIVN. username=VivanVatsa>, <User id=106575148 name=Jamie McLennan username=jamiemclennan29>, <User id=1470841921375543303 name=Chris Pronger username=chrispronger>, <User id=1273505391553449985 name=Aadit Sheth username=aaditsh>, <User id=2231021892 name=Tobi Emonts-Holley username=PeakTobi>, <User id=337470092 name=Robbie Crabtree username=RobbieCrab>, <Us

In [13]:
query = 'from:scaler_official -is:retweet'


tweets = client.search_recent_tweets(query = query, tweet_fields=['created_at', 'author_id'], max_results=100)

print(len(tweets.data))


14


In [18]:
for tweet in tweets.data:
    print(tweet.text)
    print(tweet.author_id)
    print("-------")

    

Meet the industry veterans https://t.co/RLvW6YUUaY
1194875574331699200
-------
Register Here: https://t.co/UeC0zeOyMk https://t.co/vRRM9w5IC9
1194875574331699200
-------
Hey Machan!
Are you a fresh graduate or a working professional? Do you feel lost and are in dire need of guidance on landing a top product-based tech company?⬇️

#CreateImpact #TamilNadu #VirtualEvent #Meetup
1194875574331699200
-------
Data visualization explained the right way. Let us know what other terms you’d like a simplification for? 

#datavisualization #dataanalytics #datareporting
1194875574331699200
-------
"I could learn a lot better when things are structured... Scaler just provided me that." Dhiraj felt that his current job didn't challenge him enough, which inspired him to upskill and scale his career.

#BehindTheWin #ScalerStories https://t.co/0occYiOCLI
1194875574331699200
-------
Register here - https://t.co/ajBUlUf5ho

PS.: We’ve got free pizza and coke. And a special bag of goodies to take home. htt

In [19]:
query = "#omicron -is:retweet"

data = {
    'text': [],
    'created_at': [],
    'author_id': []    
}


tweets = tweepy.Paginator(client.search_recent_tweets, query = query, 
                 tweet_fields = ['created_at', 'author_id'],
                 max_results = 100)

tweets

<tweepy.pagination.Paginator at 0x107814ac0>

In [29]:
for tweet in tweets.flatten(limit=2000):
    data['text'].append(tweet.text)
    data['created_at'].append(tweet.created_at)
    data['author_id'].append(tweet.author_id)
    
    
    
    

In [25]:
import pandas as pd

In [26]:
df = pd.DataFrame(data)

In [28]:
df.shape

(1000, 3)

In [41]:
res = requests.get("http://quotes.toscrape.com/page/2/")
res

<Response [200]>

In [44]:
res.content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThis life is what you make it. No matter what, you&#39;re going to mess up sometimes, it&#39;s a universal truth. But the 

In [43]:
from bs4 import BeautifulSoup


soup = BeautifulSoup(res.content)
soup



<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters m

In [51]:
soup.find_all('div', class_ = "quote")[0].span.text

"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly

In [35]:
def print_even(test_list):
    for i in test_list:
        if i%2 == 0:
            print(i)
            return i
            
            
test_list = [1,3,4,5,6,7]

print("Original list", test_list)
print("==========")
for i in print_even(test_list):
    print(" ")



Original list [1, 3, 4, 5, 6, 7]
4


TypeError: 'int' object is not iterable

In [37]:
img_url = "https://static.toiimg.com/thumb/msid-60132235,imgsize-169468,width-800,height-600,resizemode-75/60132235.jpg"

res = requests.get(img_url)
img_data = res.content




In [38]:
with open("dog.jpg", 'wb') as f:
    f.write(img_data)