# Lab 2.2<br>Data Acquisition

## BUS152 - Spring 2024 <br> Brian Brady

### __Objectives__

With this lesson we finally get a chance to pull in some real data!  There are lots of different ways you can connect to data, which will depend on whether you're in a production or development environment.  The reading material covers a wider array of accessibility, but for our purposes now we'll cover a few of the most common ways to get your hands dirty.  Our objectives for this lesson are to explore a few 

- Accessing Library Datasets
- Downloading CSV files

### __Accessing Library Datasets__

This is probably the simplest way to get data to work with!  The are several libraries that have zipped up data and made available for us to download and use.  These are mainly datasets you can play with to learn, develop your model building skills, and to test alogrithm performance.  They are usually real-world data, however probably won't be the data you need for any real project you want to work on.  

If you have never installed the library you want to work with, then you will need to do this first before importing into your environment.  This is a one time step through the Anaconda Prompt, and you can proceed to the "import" command the next time.  To install, we would normally want to go through the anaconda package management system if possible, but often a library we want to work with won't be available there so we'll use the "pip" installer instead in those cases.  Try googling "conda install *library name*" and see if there's an option, if not, go ahead and use the `pip install *library name*` option instead.

After you've installed the library, go ahead and run the `import` commands below.

![cond_prompt](../images/anaconda_prompt.png)

In [None]:
# Import packages
from pydataset import data
import pandas as pd

Excellent.  Now you have the "data" function from the "pydataset" library loaded into your environment and available for use.

So where to start with this one?   Don't worry, it's easy enough.  We can call the `data()` function and see the names and ids of all of the sets available for our use.

In [None]:
# Turn off pandas truncation so we can see the entire list of datasets
pd.set_option("display.max_rows", None)
display(data())

Ok, great.  See one that sounds like it might be fun to play with?  I do, so let's print the details for the "income" set with the code below.

In [None]:
# Show the documentation for a given dataset
data("income", show_doc = True)

Awesome.  Sounds interesting.  Let's go ahead and read this one into memory and see what we've got!

In [None]:
# Reset the pandas truncation
pd.set_option("display.max_rows", 10)

# Read in the "income" dataset
dat = data('income')
dat

And there you go!  That's all there is to it for playing around with datasets from the `pydataset` library.  Go ahead and snoop around and see if you see any others that look interesting.  There are plenty of other libraries providing datasets too, but this one is a pretty decent sized and comprehensive list to get your started.  If you're looking for more, feel free to check out the `ucimlrepo` or `scikit-learn` datasets, among others.

### __Downloading CSV Files__

This is probably the most common way you'll get real data once you move beyond playing around with toy sets in packaged libraries.  These can be files on your local machine, or off of shared sites like Google Drive or Github.

Let's start with a local file on your machine.  If I'm working in a local IDE like Spyder or VS Code, normally I will go ahead and set my "working directory" using the `os.chdir()` function from the `os` library.  This is just a convenience step so your application knows where home base is for all of your files you're going to be working with importing and exporting.  Otherwise, you may need to point it to this folder manually each time you try to work with anything in this location.  

Here's an example of a working directory path for my machine.  You'll need to update the `os.chdir()` path to wherever you cloned your course files from the Github repository.

In [None]:
# Set your working directory
import os
import pandas as pd
#os.chdir('D:/Brian/Althoff/BUS152-2024-Spring/datasets/')
os.chdir('C:/your_path/BUS152-2024-Spring/datasets/')

In [None]:
# Read in the data using the pandas `pd.read_csv()` function
dat = pd.read_csv("Davis.csv")
dat

The pandas documentation is fantastic, so don't forget to check <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">Pandas Read CSV</a> if you run into any issues.

This works just as well for datasets from cloud shared sites like Github.  See below and give it a try.

In [None]:
# Read in Davis height/weight data from Github as a .csv file
url = 'https://github.com/bradybr/practical-data-science-and-ml/blob/main/datasets/Davis.csv?raw=true'
dat = pd.read_csv(url)
dat

We're only just barely dipping our toe in the water here since this is all we'll need for our course, but there's so many more ways to get data that we don't have time to cover.  See <a href="https://bradybr.github.io/practical-data-science-and-ml/Chapter5/data_acquisition.html">Chapter 5: Data Acquisition</a> for more detail.