# Week 4: Python Libraries

This week, we learned how to do basic operations with the numpy library, visualize data with Matplotlib, and gather data from the web using urllib and BeautifulSoup. Now let's practice these concepts to get a little more familiar with them.

## Part 1 : numpy

Let's use numpy to perform a few simple matrix operations using a matrix we will create from an example data set. For this assignment, we'll be using a collection of data about Near-Earth asteroids and comets, via the NASA website. For brevity, we'll usually just call these NEOs (near-Earth objects).

First, load the data from the provided JSON. Review week 2 and 3 homework assignments and lessons if you need help loading a JSON file.

In [2]:
# Here, we import the necessary libraries
import numpy
import json

# Remember to change the file location if needed
# Replace "[ ENTER CODE HERE ]" (including the brackets) with your solutions whenever it appears
path = "./datasets/near-earth-objects.json"
f = open(path)

For this assignment, we'll load the JSON objects a little different than before. This time, we'll use the `for ___ in ___` syntax to add each line to the dataset. By doing it this way, if the length of the file changes but we still want to load the whole thing, we don't have to adjust how long the loop runs - it will just load data until there's no data to load. Recall that we have imported the `json` library!

In [None]:
dataset = []

for asteroid in f:
    [ ENTER CODE HERE ]

# Let's print the first data item to see what our data looks like
dataset[0]['q_au_2']

The names of the data fields have been shortened to save space. Knowing what the properties are isn't necessary to complete this assignment, but when working with real data it can be important to know what each one is for many reasons. For example, in this dataset if you saw an `i_deg` with value `7000`, but didn't know what that field represented, you may not notice that it's not likely that an NEO's orbit would have a 7000 degree offset from Earth's orbit, since these angles only go up to 180, and so there may be a mistake in the data. 

If you're curious, the data fields are: 

- `h_mag` - the absolute magnitude, or how bright the asteroid would be if it were 1 Astronomical Unit (AU) away from both the Earth and the sun (1 AU = how far the Earth is from the Sun). 

- `i_deg` - the angle of offset that the object's orbit makes compared to the Earth's.

- `moid_au` - (Minimum Orbit Intersection Distance) the closest distance the orbit of the object reaches to the orbit of the Earth (measured in AU).

- `orbit_class` - the classification of orbit the NEO has. The main 3 you will see in this data are "Atens", "Apollo", and "Amor", named after prominent asteroids that have similar orbits.

- `period_yr` - how long (in Earth years) the NEO takes to go around the Sun and make a full orbit.

- `pha` - short for "Potentially Hazardous Asteroid", and so a `Y` or `N` indicates whether astronomers believe that this NEO has the potential for a harmful impact. (Don't worry, a `Y` doesn't mean it *will* collide - we just have to keep an eye on this asteroid)

- `q_au_1` - the shortest distance the NEO's orbit makes with the Sun.

- `q_au_2` - the longest distance the NEO's orbit makes with the Sun.

## Numpy arrays

Now we'll create some numpy arrays for some of the data points for each near-Earth object to do some numpy operations on. Why don't we grab the `h_mag`, `moid_au`, and `period_yr` properties for every asteroid and put them each in their own numpy array. 

In [None]:
# First, create a regular list of each of the 3 properties we want to collect:
magnitude = [ ENTER CODE HERE ]
MOID = [ ENTER CODE HERE ]
orbital_period = [ ENTER CODE HERE ]

# Now, convert them to numpy arrays:
magnitude = numpy.array(magnitude).astype(numpy.float)
MOID = numpy.array(MOID).astype(numpy.float)
orbital_period = numpy.array(orbital_period).astype(numpy.float)

# Let's print out one of the arrays to check out the data:
magnitude

### Statistical Operations

Before we create and use matrices, let's process some of this data to gain some insight into our NEOs. Let's suppose we're interested in seeing the median magnitude, the standard deviation of all MOIDs, and the average orbital period. Luckily, numpy arrays support these operations natively. 

In [None]:
# Review your lecture material to see how to do this - but you may also have to do some good old-fashioned Googling!
print("Median Magnitude: ", [ ENTER CODE HERE ])

In [None]:
print("Standard Deviation for MOIDS: ", [ ENTER CODE HERE ])

In [None]:
print("Average Time to Orbit Sun: ", [ ENTER CODE HERE ])

## Numpy Matrices

Arrays are very useful in general, but we often want to work with our data in matrix form. Let's convert our 3 arrays to a matrix and perform some common linear algebra operations on them.

In [None]:
# First, create an array of arrays named "NEO_data"

NEO_data = [ ENTER CODE HERE ]

NEO_data

Now convert it to the numpy `matrix` type:

In [None]:
NEO_data = [ ENTER CODE HERE ]
NEO_data

Now that we have a numpy matrix, try doing some matrix multiplication. The matrix times its transpose comes up frequently when working with matrices, so let's do that for practice. 

In [None]:
transpose_product = [ ENTER CODE HERE ]
transpose_product

Try getting the trace of the square matrix you just created. If you're unfamiliar with the trace of a matrix, it is the sum of the diagonal entries. While you *could* do this with a loop, numpy has a way to do this more easily.

In [None]:
NEO_trace = [ ENTER CODE HERE ]
NEO_trace

Now that we're familiar with numpy and its basic operations, let's move on to visualizing the data. 

## Part 2 : Matplotlib

When working with a large amount of data, it can also be very helpful to visualize it. With numpy, we were able to maniupulate the data and do various operations on it. However, even though large matrices and arrays are easy for a computer to understand, for us humans it can be hard to take in and process such a large amount of data. So let's try using Matplotlib to create some visuals to help us see things a little clearer. 

First, we'll do a simple graph. We can use our NEO magnitude data for our Y-axis, but we'll need an X-axis to plot it against. So let's create one by filling an array with the numbers 1 to the size of the magnitude array so every NEO is shown in our plots.

In [None]:
# First we have to import the Matplotlib library - feel free to change the name after the 'as'
import matplotlib.pyplot as matplt

# Here's how to create an empty numpy array with a specific size
xAxis = numpy.empty(magnitude.size)

for i in range(magnitude.size):
    xAxis[i] = i

xAxis

Now that we have an X-axis, let's make a plot of the magnitudes. Use whatever alias you chose for `matplotib` (i.e. from `import library as alias`) to perform its functions.

In [None]:
# Remember that your x-axis should be the variable named "xAxis" and your y-axis should be the magnitude values
[ ENTER CODE HERE ]

This allows us to more easily visualize our data. But it's still a bit hectic, so let's sort the data with numpy and then create a bar plot. 

In [None]:
# Create a copy of magnitude data sorted
sortedMag = [ ENTER CODE HERE ]

# Create a bar plot
[ ENTER CODE HERE ]

But we can see that all the data is pretty high up in the graph. We don't necessarily need to have the bar start at 0 since we can look at the graph and see that all the NEOs have magnitude between about 14 and 25. So let's limit our Y-axis to those values. While we're at it, let's title our graph and label our Y-axis so that scientists looking at our nice plot can get some idea of what they're looking at. 

In [None]:
# First limit the axes
[ ENTER CODE HERE ]

# Then label the y-axis
[ ENTER CODE HERE ]

# Then title the graph
[ ENTER CODE HERE ]

# Now plot!
[ ENTER CODE HERE ]

## Part 3 : urllib and BeautifulSoup

Now that we have some experience processing data with Python, let's practice collecting data. As you saw in the course slides, often when we collect data we do so by going through the `HTML` of a website and only taking the parts we need. As you can imagine, this can get tedious once you do it for more than a few web pages. So let's use the maagic of Python and urllib and BeautifulSoup to do all the legwork for us.

Keeping with the space theme, let's gather reviews for the video game "Kerbal Space Program", from the review aggregator website Metacritic. This will be very similar to what you saw in the lecture slides, but we will have to make some modifications to gather the data from Metacritic since it has a different structure. 

In [None]:
# We will have to get the html a little differently from your course slides 
# since Metacritic requires a specified user agent
from urllib.request import Request, urlopen

# Similar to your lecture slides, we will request the html from the site, but we also specify the second parameter,
# the user agent
req = Request("https://www.metacritic.com/game/pc/kerbal-space-program/user-reviews", headers={'User-Agent': 'Mozilla/5.0'})

# We read it like any other file
html = str(urlopen(req).read())

# Uncomment the next line if you want to see all the html we grab (it's a little hard to read unformatted)
# html

We can see that reviews are contained in blocks that begin with `<div class="review_grade">`, so we'll use the `split` method to look for those blocks. 

In [None]:
# Follow your slides for how to do this portion of the code
reviews = [ ENTER CODE HERE ]
len(reviews)

Now let's gather the rating that each of these users gave the game (a scale from 1 to 10). We will do this similar to the slides you are given. 

In [None]:
def parseReview(review):
    d = {}
    # To get the rating, we have to be careful.
    # Not every rating has the same html classes right before the rating, but it does have the same html right after (a closing </div>).
    # To get around this, we use split to just grab everything between the two tags, i.e. everything between > and </div>.
    d['rating'] = review.split('>')[1].split('</div')[0]
    return d

reviewDict = [parseReview(r) for r in reviews]
reviewDict

And that's it! We've successfully extracted our own dataset from these user reviews! You are encouraged to modify the code to add more parameters, such as the display name of each reviewer, or gather the text portion of their review. For help with that, follow the lecture slides and use your knowledge of Python to modify the `parseReview()` method. You should also combine what you learned about urllib with the other parts of this HW. For example, try using matplotlib to visualize the ratings we collected.