We will start by importing the numpy and pandas modules, which will need for
most of the notebooks. They should already be available in Colab.



In [None]:
import pandas as pd
import numpy as np



## Exploring the core-based statistical areas (CBSA) dataset

Now, as in the text, we want to load the `cbsa` dataset. You should have seen
the code to do this already, but we need to make an adjustment. In the text it
is assumed that the data lives in the same place as our code. Here, our code is
in Colab but the data is still over on GitHub. In order to load in the data, we
have to start our path with the right address for the book on GitHub. It will
be helpful to store this as a variable in Python, which we can do with the 
following code:



In [None]:
ubase = "https://raw.githubusercontent.com/distant-viewing/hdpy/refs/heads/main/"



All of the datasets in the book can be accessed by combining the url base above
with the code that you see in the book. In Python, we can combine two strings by
simply adding them together. So, here is the code to read the `cbsa` dataset.
Make sure you compare it to what is in the text of the book.



In [None]:
cbsa = pd.read_csv(ubase + "data/acs_cbsa.csv")
cbsa



A *method* is a function attached to an object. It is called by using the object
name followed by the name of the function. From there, it works just like any
other function. As shown in the text, use the `describe` method below to see a
description of the columns in the `cbsa` dataset.





And, now, use the `info` method to find basic information about each of the 
columns in the dataset.





There are many more methods for DataFrame objects than the two shown in the
text. Below, use the `nunique` method to see the number of unique values in
each column. Go through each and make sure that they seem reasonable to you.
Does anything seem surprising?





Some methods also take arguments. For example, the `sample` method takes an
argument called `n`. It will return a random set of `n` rows from the dataset.
Call this method below with a value of 10. Think about how random samples of the
rows might be helpful in a data analysis project. Note that you can use either
a positional argument or a named argument here (try to do both!)





Sometimes methods can return new DataFrame objects. In the code below, try out
the `isnull` method. What happens?





When one method returns another object, we can *chain* together methods by adding
another method at the end of the result. Below, chain the `isnull` method with
the `sum` method. Do you understand what this is doing?





The `get` method is another common DataFrame method that takes one argument which
has the name of a column in the dataset. Use it below to get the 'division' 
column of the data.





The value returned by the `get` method is a new kind of object called a *Series*.
It represents a single column of the data. Series objects have their own methods.
One particularly helpful method is `value_counts`. Chain this together below to
see the number of areas in each region of the country.





We've already seen a lot of methods for DataFrame and Series objects, but there
are many more. You can find all of the options for both on the following help
pages:

- [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)
- [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html)

Take a moment to find a couple of the methods we have already seen and look at
what the help pages for each look like.

## Loading other datasets from the book.

Now that you have an idea of how to load datasets from the book's GitHub
repository, modify the code from the text to load the food prices dataset in 
the chunk below:





And, similarly, load the Wikipedia UK authors dataset.





All three of these will be used throughout the course notes. If you have
questions about what they mean, now is a great time to ask!

## Loading data from Google sheets

Only working with data that has already been put online is fairly limiting. We
will want to be able to get data in Python that we have created ourselves. Let's
see how to read data in from public Google sheets.

Before we get there, it will be helpful to see a very nice feature of Python 
called f-strings. If you put the letter "f" before a string, then anything in
the string inside of curly braces will be evaluated as a Python variable and
stuck inside of the string. That sounds more complicated than it really is.
Change the code below to have your name in the variable called `my_name` and
run it. Notice what happens with the string. Feel free to experiment and see
what else this can do.



In [None]:
my_name = "Taylor"
f"My name is {my_name}"



Now, to read data in from Google sheets we need to know the file id and the
sheet id. The file ID is in the url of the sheet after the "/d/" and the sheet
id is after the variable `gid=`. Adjust the values below to match the sheet that
we have from class:



In [None]:
file_id = "1CVHofm5ukU3CRh2-jvJdu_t4jPXRPFryOTZIo-lWN6E"
sheet_id = "0"



Now that you have those, we can use them to construct a path the sheet that 
returns a CSV file. Run the code below to see what this looks like:



In [None]:
url = f"https://docs.google.com/spreadsheets/d/{file_id}/export?format=csv&gid={sheet_id}"
url



Now that we have a URL, it is straightforward to download the dataset just as
we did with the book ones above. Do that below:



In [None]:
df = pd.read_csv(url)
df



Try out a few of the functions that we have above in the code block below. If 
you want, you can add new code blocks in order to have multiple outputs of each.





Getting data locally on your machine into Colab is a little more complex, but 
we will see approaches to that in future notebooks.
