# Assignment \#02

For this exercise, we're going to be using the college scorecard data. [The website for the data is here.](https://collegescorecard.ed.gov/data/)

The database documentation can be found in this [file.](https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx) You'll need to refer to it through the assignment. 

**Scenario:** We have a big data problem where our data are spread across many different files. Although these files are not _that_ big, let's pretend that if they were one big file then that file wouldn't fit in memory (RAM). If we process each file individually, we can store in memory only the data we want to further examine. We want to extract some data from each file, do a calculation, and save only the results we are interested in.

**Your Goal** is to cycle through data for public and private non-profit institutions located in Raleigh, North Carolina for all years of the college scorecard to calculate the cost of attendance through time. These data are split over individual CSV files for each school year. 

**Outcomes** 
1. Slice data
1. Index data
1. Describe data

**Credit Blocks**
1. Section 0 *memory storage and deletion in python*
1. Section 1 *reading in data*
1. Section 2 *slicing*

**Data Set**
The data set is located in the shared folder explored during class at `/shared/dsc495/CollegeData`

***
## **Section 0**

Efficienty storage of memory is key to processing large quantities of data. Determining the memory used by a process is not straight-forward in python, but we can determine where objects are stored and how much memory they take up. 

1. In the first box, create a [numpy array](https://numpy.org/doc/stable/reference/generated/numpy.array.html) named `a` of length `4` of [floating point numbers](https://docs.python.org/3/tutorial/floatingpoint.html) with type [>16](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html). 
2. Use two print statements to show the class and data type of the array. 

In [1]:
import numpy as np

3. Find the location that array `a` is stored at. Print both the [unique integer](https://docs.python.org/3/library/functions.html#id) and [hexadecimal](https://docs.python.org/3/library/functions.html#hex) representation. 

4. Create a new variable `b` from `a` that is equal to `a`. 
5. Print the locations of b as done in part 3.
6. Add 1 to all elements in b. 
7. Print the locations of b as done in part 3.

8. In the below markdown chunk, describe what happened to the object b after you added 1 to it:

9. Find the size of the array `b` in bytes using the function `getsizeof` from the `sys` [library](https://docs.python.org/3/library/sys.html#getsizeof). 
10. [Delete](https://docs.python.org/3/tutorial/datastructures.html#the-del-statement) a and b from memory. Show they are gone by creating a NameError. 

In [2]:
import sys

***
***
## **Section 1**

Reading in data

1. Using pandas, read in the entirety of `MERGED2015_16_PP.csv`. Assign it to variable df. Use the `low_memory=False` option
2. Print the first 7 rows using the `.head()` method. 

In [3]:
import pandas as pd

3. Take a basic look at the size of the data frame in memory. Use the `getsizeof` function from Section 0 and print the resulting size in MB. 

4. Find the number of rows and columns in the [data frame.](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

***
***
## **Section 2**

Slicing

If we don't tell pandas to assign a column that acts as the index for each row, it will assign an arbitrary index starting at 0. To ensure that our results are consistent across each file (and assuming the values are correctly entered), we can set a column that contains a unique identifier to our data frame. Then in future operations, we can use that to merge our results.

**NOTE:** You can do this step during data import using the argument `index_col="column_name"` in the `read` function. Do not go back to that step. 

**NOTE**: _This is an extremely important step_, especially when you'll be working with joins/merges. Never assume your index is correct!

**NOTE:** Use the `inplace` argument so that the index is set directly on our existing data frame. The alternative: `col_data = col_data.set_index("UNITID")` accomplishes the same task. Choice between the two depends on your desired results and coding style. Using `inplace` assumes the reader of your code understands what that argument will do, while assigning a new/existing variable makes clear what's happening.

5. Set `UNITID` to the index using the inplace option to avoid assignment. 

Now we will begin slicing the data. 

Slicing the data allows us to look at a subset of the whole. Here, state abbreviation is the column `STABBR`. To access specific columns in python, we can use `["COLUMNNAME"]`. 

6. Get the first five state abbreviations in the data set. Hint: The data is nearly in alphabetical order. 

To get a subset based on a criterion, use boolean operators. [Boolean operators](https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not) take the values `True` or `False`. 

A data frame indexed by a boolean operator returns only rows with corresponding value `True`. E.g. if `df` has 3 rows, then 
`df[ [True, False, True] ]` returns the 0th and 2nd row of the data frame. 

7. How many institutions are located in North Carolina?

To write more complex filters, we wrap each conditional in parentheses and use the appropriate logical operator (ampersand `&` for and, pipe `|` for or, and tilde `~` for negation) to filter by multiple criteria.

8. Write a filter that returns results for institutions that are:
*   in city of Raleigh
*   in North Carolina
*   Public or private non-profit

You may need to refer to the data documentation to find the correct columns for these filters. 

Save this result to the new variable `df_nc`. 

**Hint:** a generic example looks like: `df[(df["col_1"] == 123) & (df["col_2"] == 456)]`

9. Print the number of universities that satisfy this criteria. 

Great! We have now calculated reuslts from first data file. _But_ we need to do this process for each data file in our database. In our next exercise, we will look more at our approach to getting through multiple files and aggregating our results. Let's think ahead to what we need to do next.

We'll continue this exercise in a future assignment.