![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)

# Working With Open Data: Car Mileage Data Part 1



As the availability and quality of open data (or data that is freely available to the public) is increasing at an astonishing rate, having the skills to work with and analyze data is becoming more and more important. In this notebook series we will walk you through the basics of working with open data using a Jupyter notebook, as well as some of the first steps you can take in terms of data exploration and visualizations. As these are meant to be introductory notebooks for someone who may not have seen any code or a Jupyter notebook before, we will walk through each step involved in detail and keep our analysis at a fairly high level. Of course, we will also show how a high level analysis can also lead to interesting conclusions or realizations.


# Bringing Open Data Into Jupyter

The easiest way to load open data into Jupyter is using a Python package called pandas. Pandas is a free and open source library for Python that allows you to manipulate and analyze almost any data you manage to import easily and quickly. It can load CSV files, spreadsheets, text files, and even tables directly from webpages (although, sometimes those tables require a little more processing). In essence, you can think of the pandas library as "Excel on steroids".

For this first demonstration, we will walk you through the process of downloading, manipulating, and visualizing an open data set directly from an open data portal. We'll be using this website: [data.opendatasoft.com](https://data.opendatasoft.com). In particular we will be using a rather large data set of vehicle fuel economy available at [this link](https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017%40kapsarc/table/?disjunctive.make&disjunctive.model&sort=-year). If you travel to that page, you should see a data set that looks like the picture below.

> ![Screenshot of the data.opensoft.com](images/car_data_screen.png)

From that screen if you were to click the large "Export" button it will take you to this page:

>![Screenshot of the download screen](images/downloadscreen.png)

If you click the link beside your desired file format, you could download that directly to your computer. However, as we're working on the hub, rather than downloading it to our computer then re-uploading it to the hub, we're going to get this notebook to download that data directly from the site. To do that, if you right click on the link besides the CSV file and click "Copy Link Address" you'll get the path to the actual data file. We will be using this link to download the car data directly into Jupyter. 

Before we can download that data we have a few book keeping items to take care of. First, we need to import the pandas library and set our environment up to place any graphs or visualizations we make directly into the notebook. This is done in the cell below. 

In [None]:
# This first line imports our pandas library, and gives it the name "pd" (so we can type less)
import pandas as pd
# we also import the plotting library
import matplotlib.pyplot as plt 
# This tells Jupyter we want to place our graphs directly into the notebook. 
%matplotlib inline


## Getting the Data Into Jupyter

Using that link we copied earlier, we can import that data directly into Jupyter. Let's first make our lives a little easier and assign our URL to a variable named `url`, which is done below. 

In [None]:
# Notice how we've placed the url between quotes!

url = 'https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017@kapsarc/download/?format=csv&timezone=America/Denver&use_labels_for_header=true'

Now comes the matter of downloading that data set. Luckily pandas comes with a function called `read_csv`, which conveniently enough will read a CSV file into what is called a _Dataframe_, which, for our purposes can just be thought of as a table of data. 

There is one complication. A CSV file can be separated by either commas (","), semi-colons (";"), or tabs ("\t"). In our case, from the image we have above, we know that our data set is separated with semi-colons. By default pandas assumes data is separated by commas, so we have to specify our alternative delimiter when downloading this data set. This is a rather large data set, and it could take a minute or two to download, you'll know when it's complete when `In [*]` in the upper left corner outside of the cell below changes to `In [3]` (or a larger number if you've run multiple cells) 

## NOTE: We may want to host this data differently so it downloads faster in a demo, or save it locally? This takes about 30 seconds, but I don't know what would happen if a bunch of people tried at once

In [None]:
''' 
Here we're actually downloading the data set which is at 'url', the download link we copied earlier. 
We've also specified our data delimeter with the 'sep' command, and we've stated that the character
is a semi-colon. Finally, we're also assigning this data to a variable we've called car_data
'''
car_data = pd.read_csv(url, sep=';')

# This writes our newly dowloaded file to a local CSV so we don't have to download it again
# if we want to use this data later, the term in the quotes is the name of the file that 
# we will create while saving. 
# car_data.to_csv("car_data.csv")


## Exploring the Data

Now that we have our data set downloaded, let's take a look at the dataset that we've downloaded 

In [None]:
car_data


If you scroll to the bottom of the table window above you'll see the the number of columns and rows of our table.  This particular data set has approximately 40000 rows and 83 columns. That's quite a lot of data! This table also includes "non-numeric" data such as car and model names, which can occasionally cause problems. You may also have noticed several entries in the table of `NaN`, this simply means there was no data entered at that location in the table and can be thought of "an empty cell".  

We're going to show you just how easy it is to work with this amount of data using Jupyter notebooks, and how to deal with non-numeric and empty data easily and effectively. 

### Getting Column Names
For any analysis with a data table, it is important to know the column names of the data set. Unfortunately our table is too wide to see all of them in the table above. Luckily for us, we have other ways of extracting this information:

In [None]:
'''
With a dataframe, the Python function 'list' returns a list of the column headers
in any given data frame. The column headers of our car_data data frame our printed
below. 
'''

# Note: If you want this to display as a single column, remove 'print' as well as its 
# parenthesis (the first and last ones) 

print(list(car_data))


Where a list of what many of the less obvious column headers represent are available [at this link](https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017%40kapsarc/information/?disjunctive.make&disjunctive.model&sort=-year). 

### Understanding the Data

Before work our way through using this data frame to create a visualization, sometimes it's useful to try and understand what data you have available. In our case, it is of interest to see how many car models and manufacturers there are in the data set. First, let's see how many models (i.e. Honda Civics, Ford Explorer, Lamborghini Huracán, etc.) are in the data set. This is done below

In [None]:
'''
Here we're calling the 'model' column only from within our data frame by using square 
brakets after our data frame name, and our column name in quotes. The len() function
simply returns the length of a given list or column. 
The '.unique()' function returns only the unique models in the data set 
(if you remove .unique(), you'd get the length of the entire data set!)
'''

len(car_data['model'].unique())

That's a lot of car models! At this point in any analysis without a specific goal in mind or a specific models to compare, this might be a few too many to do an reasonable exploration. However, let's try the same to see how many car manufacturers are in the data set. 

In [None]:
len(car_data['make'].unique())


In [None]:
# This is just for a "quick view" of what the car makers are. 
print(car_data['make'].sort_values().unique())

135 car manufacturers is a lot more manageable than nearly 4000 car models for a first stab at analysis, let's start to visualize some of this data.

### Visualizations 

Let's focus on average MPG (miles per gallon) for each car in both the city and on the highway. There are under the `UCity` and `UHighway` columns in the data table. If we're interested in all the data, plotting these data is very easy using pandas and shown below. 

In [None]:
'''
Here we call the .plot() function from our ploting library on our data frame by typing
'car_data.plot( ... ). The arguments inside the plot function are explained below

1. x = 'UHighway'  : This specifies which data to use for the x component of each data point. In this case 
                     we're using 'UHighway', which is the fuel economy in MPG of each vehicle while driving
                     on the highway
                   
2. y = 'UCity'     : This specifies which data to use for the y component of each data point. In this case
                     we're using "UCity", which is the fuel economy in MPG for each vehicle while driving
                     in the city
                   
3. kind = 'scatter': This specifies what kind of plot to create. In this case we're making a scatter plot, 
                     however you could also specify a line, bar, etc. plot here instead
                     
4. title = '...'   : This specifies the title to place on the top of the plot 
'''

car_data.plot(x ='UHighway', 
              y = 'UCity', 
              kind = 'scatter', 
              title="All Car MPG")
                                                                                

Where this plot doesn't tell us a lot, but to be fair we wouldn't expect it to! However, suppose we were concerned with only one car manufacturer instead of _all_ the vehicles in the data. For example, let's choose an every day car that anyone has access to: a Bentley. We can view only the Bentley data as follows

In [None]:
'''
Here we're creating a new data table consisting only of the rows which have the 
word "Bentely" in the 'make' column. This allows us to quickly filter the data down
to only the data that is relevant to cars made by Bentely. 

Essentially what this line of code does 
     
     car_data[car_data['make'] == 'Bentley']
     
Is create an entirely *new* data frame of only rows where 'make' column is identical to "Bentley".
This will ignore any and all data that is not associated with the word "Bentley" in the "make" column

'''
# Note that there are two equals signs! 

Bentley_data = car_data[car_data['make'] == 'Bentley']

Bentley_data.plot(x ='UHighway',
                  y = 'UCity',
                  kind = 'scatter', 
                  title="Bently MPG")

Now we're starting to get some data that we might find more interesting! However, suppose we'd like to connect these dots with lines, we would need to change the above code to something like this

In [None]:
Bentley_data.plot(x ='UHighway',
                  y = 'UCity',
                  kind = 'line', 
                  title="Bently MPG Line Graph")

Of course, that graph is nearly incomprehensible! That's a result of our frame plotting points as it sees them - our data is unsorted! However, luckily for us, this is easily remedied by sorting our data before we plot, as shown below

In [None]:
'''
Here we're sorting *every row* of our data frame with respect to the specified column "UHighway" 
By using the sort_values function included with pandas
'''

Sorted_Bentley = Bentley_data.sort_values("UHighway")
Sorted_Bentley.plot(x ='UHighway',
                  y = 'UCity',
                  kind = 'line', 
                  title="Bently MPG Line Graph Sorted")

### Formatting a Plot

You may have noticed that all the plots we've been showing seem to lack a lot of "pleasing" formatting, and more important factors such as a labeled y-axes, or any mention on how to control the axes labels. The code block below specifies some of the basic settings that we can use to make our plots look a little more presentable. 

In [None]:
'''
Here we explain what each line does that you haven't seen before using a different comment structure
to introduce you to how you may see comments in other resources/online. 
'''

Sorted_Bentley.plot(x ='UHighway',
                  y = 'UCity',
                  kind = 'line', 
                  figsize = (12,8),                # Set the figure size to be 12 x 8 (inches)
                  color = 'red'                    # Changes the color of the plot, could also use color codes
                   )
# 'plt' is taken from the ploting library we imported at the beginning of the notebook 
# and is also used by pandas. Doing it out here gives us more control 

plt.title("Bently MPG Sorted", size = 20)          # Set title with font size
plt.xlabel("Highway Miles Per Gallon" , size = 16) # Add an x axes, 'size' controls the font size
plt.ylabel("City Miles Per Gallon", size = 16)     # Add a y axes label
plt.xticks(size = 14)                              # Change font size of xticks
plt.yticks(size = 14)                              # Change font size of yticks
plt.grid('on')                                     # Adds a grid to the plot
plt.autoscale(tight=True)                          # This removes the padding around the plot 

## Before You Try It Yourself

We'll soon be at a part of this notebook where we encourage you to manipulate the plots directly, however, here are a few common error messages you may see in the process, and how to fix them. 

### Common Error Messages

### Type Error
Every time you have an error in the cell below, it is usually the result of a simple typo! When this happens you'll typically get a large and intimidating error message that makes it seem a lot worse than it is. For example if you ran a code cell with this:
```python
car_data[car_data['make'] == 'Fake Car Maker'].sort_values('UHighway').plot(x = 'UHighway',
                                                                            y = 'UCity',
                                                                            kind = 'line', 
                                                                            title=" MPG Line Graph Sorted")
```

You would get the following error message which would appear to be quite intimidating!

![error messages](images/error.png)

Where all that output is the result of making one small typo! But don't let that worry you; the important information is the very first line of the error message where it says `TypeError` and shows you the approximate location of where the error was encountered. The other piece of important information is at the bottom of the output, you see the actual error message

```python
TypeError: Empty 'DataFrame': no numeric data to plot
```

What this is telling you is that you're trying to plot an empty dataframe. Which at first glance, seems impossible, we know that we have data in `car_data`! Where did it go?

The culprit here is when we're filtering down to only the rows relevant to our car manufacturer. As there is no manufacturer "Fake Car Maker" In our data set, we're not returning any data when we try to filter it down. To fix this, all we need to do is change the text 'Fake Car Maker' to a car maker which is present in the data set. 

### Name Error

If you were to run a line of code like this 
```python
car_data[car_data['make'] == Bentley].sort_values('UHighway').plot(x='UHighway',
                                                                   y = 'UCity',
                                                                   kind = 'line', 
                                                                   title=" MPG Line Graph Sorted")
```
you would get the following error message

```python
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-181-0856eb5a442a> in <module>()
----> 1 car_data[car_data['make'] == Bentley].sort_values('UHighway').plot(x='UHighway',
      2                                                                    y = 'UCity',
      3                                                                    kind = 'line',
      4                                                                    title=" MPG Line Graph Sorted")

NameError: name 'Bentley' is not defined
```

This is caused because we have not wrapped our car maker in quotes and Python is looking for a variable named Bentley. To fix this, simply change Bentley $\rightarrow$ 'Bentley'

### Key Error

Should you accidentally try to plot a column that doesn't exist due to a typo, such as the one in the a snippet like this

```python
car_data[car_data['make'] == 'Bentley'].sort_values('UHighway').plot(x='UHighway',
                                                                     y = 'THIS IS NOT A COLUMN IN THE DATA SET',
                                                                     kind = 'line', 
                                                                     title=" MPG Line Graph Sorted")
```

You would get a very long and intimidating error message, the bottom of which would be this error message
```python
KeyError: 'THIS IS NOT A COLUMN IN THE DATA SET'
```

What this is telling you is that there is no column in your table with the label `'THIS IS NOT A COLUMN IN THE DATA SET'`, to fix this, simply choose a column that you have labeled, or fix any minor typos.

Those are the only error messages you're likely to see in the snippet below (or really any pandas plot like this!), and now that you know what they mean and how to fix them, you should be all set to try out the code below with any columns/data in our table. 

## Try It Out

In the cell below we've set up the Bentley test case again, however, feel free to change the which manufacturer you're exploring. You should also feel free to explore other data for the $x$ and $y$ axis. Don't worry if something fails! Hopefully the common error message solutions above will help you out

In [None]:
car_data[car_data['make'] == 'Bentley'].sort_values('UHighway').plot(x='UHighway',
                                                                     y = 'UCity',
                                                                     kind = 'line', 
                                                                     title=" MPG Line Graph Sorted",
                                                                     figsize = (12,8))

plt.title("Bently MPG Sorted", size = 20)          # Set title here with font size
plt.xlabel("Highway Miles Per Gallon" , size = 16) # Add an x axes, 'size' controls the font size
plt.ylabel("City Miles Per Gallon", size = 16)     # Add a y axes label
plt.xticks(size = 14)                              # Change font size of xticks
plt.yticks(size = 14)                              # Change font size of yticks
plt.grid('on')                                     # Adds a grid to the plot
plt.autoscale(tight=True)                          # This removes the padding around the plot 

If you don't feel like scrolling all the way up to find other car make names, simply uncomment the cell below to see the list of car makers. See instructions in the cell to understand what we mean by "Uncomment" 

In [None]:
'''
To 'uncomment the line of code below, you simply need to delete the pound sign (hash tags for those of 
you who frequent Twitter) and the trailing space afterwads. In Python any characters in a line after a
pound sign are ignored by the program and callend a comment as you have likely seen in many of the code blocks
above. They can be very handy. You'll also notice that this block of text is contained between three
quotes (') on each side. This is known as a "block quote" and is also ignored by Python, but it is more 
useful for typing longer messages such as this. 
'''

# print(car_data['make'].sort_values().unique())
None

# Conclusion

In this notebook we demonstrated how you can download a open data set directly from the Internet into a Jupyter notebook and start to explore the data. We went through how to do some basic filtering of our data set using `pandas` as well as how to create plots plots easily and effectively from the data without performing any complex manipulations to the data itself. While we did not learn any interesting conclusions in this analysis, this represents a first initial step in data exploration. In part two of this notebook, we will go over the basics of data aggregation in order to discover more exciting and interesting trends hiding in this data set. Hopefully in part two, we can start to tease more interesting trends out of this data set.

![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)