# Intro to Data Acquisition

## Ways to Acquire Data:

### 1. Public Data

Check out the following websites for access to public data. 

* [GitHub](https://github.com/)
* [Kaggle](https://www.kaggle.com/)
* [KDnuggets](https://www.kdnuggets.com/)
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
* [US Government’s Open Data](https://www.data.gov/)
* [Five Thirty Eight](https://data.fivethirtyeight.com/)
* [Amazon Web Services](https://aws.amazon.com/)

We can use Pandas to directly import using code. 
```Python
# Import pandas with alias
import pandas as pd
 
# Assign the dataset url as a variable
url = "https://raw.githubusercontent.com/shrikant-temburwar/Iris-Dataset/master/Iris.csv"
 
# Define the column names of dataset as a list
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
 
# Use read_csv to read in data as a pandas dataframe
df = pd.read_csv(url, names=columns)
 
# Check head of dataframe
print(df.head())
```

### 2. Private Data

There are also a number of private datasets that businesses curate themselves and are under ownership of the company. For instance, Netflix’s database of user preferences powers their immense recommendation systems. There are also services that allow you to purchase datasets such as data markets like Data & Sons or crowd-sourcing marketplaces such as Amazon’s Mechanical Turks where one can outsource their data acquisition needs like data validation and research to survey participation. Often we will find usage of private data within a large production scale setting.

Pros:
Time: Readily available datasets that can quickly move a project to the next phase of the Data Science Life Cycle.

Cost: Public datasets can cut costs of collecting data down to zero.

Cons:
Messy: Data can often come in forms that require intensive cleaning and modification.
Cost: Private services can lead to high costs in acquiring data.


### 3. Web Scraping 

Most commonly used modules for webscraping are BeautifulSoup, Selenium, and Scrapy. We typically employ web scraping techniques to acquire data for small to medium sized projects, but rarely in production as this can raise ownership and copyright issues.

Example Code: 

```Python
# Import libraries 
import pandas as pd
from bs4 import BeautifulSoup
import requests
 
# Assign URL to variable
url = "https://www.codecademy.com/"
 
# Send request to download the data from URL
response = requests.request("GET", url)
 
# Create BeautifulSoup object
# Use HTML parser to parse the page's text
data = BeautifulSoup(response.text, 'html.parser')
 
# Print the first header of the page
print(data.html.h1)
 
# Instantiate list to append some content
content = []
 
# Use BeautifulSoup's find_all method to find all paragraph tags
words = data.find_all('p')
 
# Iterate through all paragraph tags
# append text to list with for loop
for word in words:
    content.append(word.text)
 
# Check content
print(content)
 
# Create dataframe of content with pandas DataFrame method
df = pd.DataFrame(content, columns= ['Text'])
 
# Check scraped dataframe
print(df)
```

__Pros:__
* Versatile: Highly adaptable method for acquiring data from the internet.
* Scalable: Distributed bots can be coordinated to retrieve large quantities of data.

__Cons__:
* Language Barrier: Multiple languages are involved when scraping and require a knowledge of languages not typically used for data science.
* Legality: Excessive or improper web-scraping can be illegal, disrupt a website’s functionality, and lead to your IP address being black listed from the site.


### 4. APIs

Unlike web scraping, APIs are a means of communication between 2 different software systems. Typically this communication is achieved in the form of an HTTP Request/Response Cycle where a client(you) sends a request to a website’s server for data through an API call. The server then searches within its databases for the particular data requested and responds back to the client either with the data, or an error stating that request can not be fulfilled.

__Pros:__
* User & Site Friendly: APIs allow security and management of resources for sites that data is being requested from.
* Scalable: API’s can allow for various amounts of data to be requested, up to production scale volumes.

__Cons:__
* Limited: Some functions or data may not be accessed via an API.
* Cost: Some API calls can be quite expensive, leading to limitations of certain functions and projects.

# Big Data

The data that these corporations maintain are so complex, they are referred to as Big Data. Data like this can not be stored on a single machine, and must often be stored in the cloud and be hosted on servers in data centers. The term Big Data is not only in reference to the sheer volume of the data, which can easily grow to the petabyte and exabyte levels, but also in the variety and velocity of the data, we often refer to these characteristics as the 3 Vs of Big Data:

* __V__ariety
* __V__olume
* __V__elocity

Google’s Chief Economist Hal Varian has listed the four key components of data acquisition in business as:

* The drive toward more and more data extraction and analysis.
* The development of new contractual forms using computer-monitoring and automation.
* The desire to personalize and customize the services offered to users of digital platforms.
* The use of the technological infrastructure to carry out continual experiments on its users and consumers.


# Creating Your Own Dataset

Check out [this link](https://towardsdatascience.com/how-to-build-your-own-dataset-for-data-science-projects-7f4ad0429de4) to learn about how to create your own datasets. 