![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)

# Working With Open Data Part 5:  Meteorite Landings and Falls Part 2

Now that we've learned how we can use Jupyter notebooks to create maps when we have geo-spacial data, as well as how to filter that data in order to create interactive widgets, let's dive into some data analysis. This is a rather rich data set, and we may be able to draw some interesting conclusions from the data. To begin, we have to first gather our libraries and data set in this notebook.

In [None]:
'''
This is exactly what we did in the previous notebook: just getting the data again
'''

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# numerical python library
import numpy as np
url = 'https://github.com/fleiser/Meteorite-landings/raw/master/Meteorite_Landings.csv'
landings = pd.read_csv(url)




# Deeper Data Analysis

As this data set contains more than just the geo-location of meteorite falls, it also contains information about if they were found or not as well as various quantifications such as the meteorite type and the mass (measured or projected). As such it might be of interest to explore these results a little further, and see if there's any interesting trends hiding in the data - waiting for us to discover them. 


First things first however; let's calculate the percentage of meteorites that are found vs the total amount that fell



In [None]:
a = len(landings[landings['fall'] == "Fell"])
b = len(landings[landings['fall'] == "Found"])
print("Percentage of fallen meteorites found:", b/(a+b) * 100, '%')

Interestingly it appears that the majority of meteorites that fall are actually found. However, that raises the question: do the meteorites left undiscovered have any distinguishing properties? Certainly there are external factors such as the geography or isolation of where the meteorite fell that make it difficult to find, but perhaps there are also some internal factors that contribute to their difficulty to find? As we have several properties of meteorite such as mass and type, this seems like the perfect opportunity to create some visualizations to try and deduce why we might not find a meteorite. 

## Histograms

In this case, a potential quantity of interest is the mass of the meteorites that fall. Perhaps there is some relationship between how massive a meteorite, and how likely it is to be found? In order to do that, let's create a histogram of the masses of both "Found" and "Fell" meteorites. 

In [None]:
# First thing we need to do is import our "numerical python" library as we've done before
import numpy as np

# Now, let's just grab a single filtered column for each the fallen and the found meteorites

# This filters down our data frame to just rows where 'fall' is 'Fell' or "Found", and then
# by typing ['mass (g)'] after wards, we're only grabbing the mass column and assigning 
# them to a new variable. 
mass_fell = landings[landings['fall'] == "Fell"]['mass (g)']
mass_found = landings[landings['fall'] == "Found"]['mass (g)']

# Here we're dropping any potential NaN values we've seen before from our columns to prevent 
# any errors when plotting. To see the error that will show, simply remoe the .dropna() 

mass_fell = mass_fell.dropna()
mass_found = mass_found.dropna()



# Make a list of data to plot
plot_data = [mass_found, mass_fell]

'''
Here we create a histogram. 

bins  : This key word specifies how many bins to put the data in for the histogram

label : This is to specify the labels for each of the bars in the histogram.  
'''
# This is another way of setting the figure size. 
plt.figure(figsize=(12,8))

plt.hist(plot_data, bins = 50, label = ["Found", "Fell"])

plt.xlabel("Mass (g)", size = 16)
plt.ylabel("Counts", size = 16)
# Uncomment the line below to see a few more data points. 
# plt.ylim([0,10])
plt.legend()
plt.show()

Well, that's certainly a peculiar histogram that doesn't really tell us too much. Unfortunately this is a consequence of having a large spread in the data - we have some incredibly massive meteorites, but we also have a great deal more very small meteorites instead. Because of the spread in values of mass, it becomes difficult to bin them for the histogram. However, that's something we can absolutely deal with! As with (almost) any time you're dealing with data with a range too large to bin effectively, we shall take the logarithm of the masses in order to "squish" our data into a more appropriate range. 

In [None]:
# Note log10 is log base ten. Simply typing 'log' will be the natural logarithm 
# The other logarithm included in numpy is log2 for log base two. Any other logarithms
# (in the event that you need them) will have to be calculated using properties of logarithms. 

mass_fell_log = np.log10(landings[landings['fall'] == "Fell"]['mass (g)'].dropna())
mass_found_log = np.log10(landings[landings['fall'] == "Found"]['mass (g)'].dropna())



Uh oh! We got a runtime warning. Specifically, a divide by zero encountered in `log10`. What this error is telling us is that we have "bad" values in order to take a logarithm; in particular there are some meteorites with zero values associated with them. In this case, we'll simply filter those out by adding another case to our filter where we're finding "Fell" and "Found" meteorites. 

In [None]:
# Here we're simply saying that the mass of the meteorite should also be greater than zero!

mass_fell_log = np.log10(landings[(landings['fall'] == "Fell") 
                                  & (landings["mass (g)"] > 0)]['mass (g)'].dropna())

mass_found_log = np.log10(landings[(landings['fall'] == "Found") &
                                   (landings["mass (g)"] > 0)]['mass (g)'].dropna())


Wonderful! By excluding meteorites with no recorded mass, we've fixed our error. Now, let's get to plotting these two quantities in a histogram now that we've taken the logarithm.

In [None]:
data_for_plot = [mass_fell_log, mass_found_log]

plt.figure(figsize = (10,6))

'''
stacked = True : Tells Python we want these plots "on top of eachother". Feel free to change it to
                 False to see the difference! 
'''
plt.hist(data_for_plot, 
         bins = 20,
         stacked = True,
         label = ["Fell", "Found"])

plt.ylabel( "Counts", size = 18)
plt.xlabel("Mass of Meteorite (log$_{10}$(grams))", size = 18)
plt.legend()
plt.show()

Where we see that there seems to be some differences in the distributions, but due to difference in counts, its very difficult to compare the two distributions directly. Not to worry however! We simply have to convert from "counts" to "percentages" for each in order to put them on the same scale. This can be considered a form of normalization. In order to do that, we have to use a few more arguments of the histogram `hist` function from `matplotlib`. 

In [None]:

plt.figure(figsize=(11,7))


'''
Here the new arguments to hist are as follows

density  : By setting this true, this tells python to calculate "the percentage" of data within each bin
           to convert from raw counts to what can be considered a "probability density" instead. This allows
           both of our meteorite fall types to be on the same scale
          
histtype : This is a stylization parameter. "stepfilled" is simply telling Python that we want bars that look
           like "steps" and for them to be colored in. 
           
           you can also change this to ‘bar’, ‘barstacked’ or  ‘step' to see how the different plot styles 
           look. We note that some styles will affect the scaling. 

alpha    : This takes values from 0 -> 1 and are a measure of how transparent the traces are. 
'''

plt.hist(data_for_plot, 
         bins = 20, 
         density = True, 
         histtype='stepfilled', 
         alpha = 0.55,
         label = ["Found", "Fell"]) 


# The dollar signs allow us to use math symbols in the text. 
plt.xlabel("Mass of Meteorite (log$_{10}$(grams))", size = 18)
plt.ylabel("Normalized Number Counts", size = 18)
plt.title("Mass Distribution of Meteorite Observations", size = 20)

# The prop key word changes the 'proportions' of the legend. 
plt.legend(prop={'size': 16})
plt.show()

Now that we've changed the scale, we can see that it seems to be the case that the more massive the meteorite is, the less likely it is to be found.

---

## Caution
Be aware of the scaling. While it _appears_ to be much more likely that more massive meteorites are less likely to be found, keep in mind the blue histogram is only about 2.5% of all observed meteorites (see previous histogram of raw counts as a reminder) 

---

### Interpretation

The fact that the more massive a meteorite is the less likely it is to be found seems counter intuitive at first glance. However, this is actually a result of both the effect of the atmosphere on large fast moving bodies, as well as a consequence of the definition of "Found" by the scientists who document these meteorite falls. 

In terms of the atmosphere, larger meteorites have a tendency to explode as they enter Earth's atmosphere. For example, in 2013 a rather large meteor exploded over Chelyabinsk Russia, and its fall can be seen in the YouTube video below.

In [None]:
# This library allows us to embed YouTube in Jupyter. 
from IPython.display import YouTubeVideo

YouTubeVideo('fBLjB5qavxY',width=1024*0.75, height=576*0.75)

However, despite all of that, if we look up this meteor in our data set, we will find something interesting. 

In [None]:
landings[landings.name == 'Chelyabinsk']

Despite the many angles available to see the meteor, it was never found? Why is that. Well, that gets us to the point of semantics. This meteor was the result of an asteroid approximately 20 meters in width with a mass of greater than 10000 tonnes. However, only about 1000 kg of the meteor to date have been recovered. As a result, this meteor is classified as "fell" instead of found, as well its mass is approximately what has been recovered.


So while there is indeed a relationship between mass and whether or not the meteor is found, this relationship is primarily due to the the greater likelihood of a large meteorite to explode, and the definition of "Found" requiring that the majority of the body to be recovered. 

For more information about meteorite explosions and the Chelyabinsk meteor see 

1. [The Wikipedia article](https://en.wikipedia.org/wiki/Chelyabinsk_meteor)
1. [This Science Alert Article](https://www.sciencealert.com/why-do-meteors-explode-when-they-reach-earth-atmosphere)

Certainly there are many other factors relating to why certain meteorites are found and some are not than simply the mass of the meteorite such as geography, or if it was reported or not. By exploring its relationship to mass however, we were able to discover an interesting trend hidden within the data. 

# Conclusion

In this notebook we demonstrated how you might go about working with your data set in order to start to tease out more interesting information hidden within a data set. More importantly, we went through the steps to create a histogram and covered many potential problems you may encounter in doing so. We covered some common errors and more subtle problems when working with a data set like this with a large spread in values, and some potential solutions to those problems. We also covered how some interesting trends in data may have perfectly reasonable explanations that are less exciting than the data may lead us to believe. It is our hope that this tutorial series has left you feeling more confident when it comes to working with open data in Jupyter notebooks.

![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)