<h1 style="text-align: center; color: green; font-family: Arial"> MATH 210 Project I</h1>

## Introduction to Pandas 




**Pandas** provides flexible, quick and expressive data structures in Python. The main goal of Pandas is to become the most flexible and powerful open source data analysis and manipulation tool for every single language.

This tutorial will mainly focus on **how to import and edit tables** to work on Python, for this reason I will explore the following two functions:
    
* `pd.read_table` (see the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html))
* `pd.DataFrame` (see the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html))

### Content 

    1. Loading data
    2. Pandas datastructures
    3. Cleaning and formatting data
    4. Basic visualization
    5. References

### How do we import the package?

In [None]:
%matplotlib inline 
#Explains Jupiter that we want every plot inside of our page.

import pandas as pd
#Extract every function from the Pandas package.

import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_rows", 10)
#Set a maximum of rows to be display for each table.

LARGE_FIGSIZE = (12, 8)
#Set a constant size for the displayed figures.

In [None]:
# Change this cell to the demo location on your Jupiter
%cd ~/Math210/Project1
%ls

### 1. Loading data

#### Import a local text file:

The function [`pd.read_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html) has many posible inputs, however in the ideal case we would not need to change these parameters. The file location should be specified with quotation marks, and it should be in a file inside of our Jupiter. This code is an example of how we could import local text files.

In [None]:
filename = "temperature_lattitudes.dat"
full_globe_temp = pd.read_table(filename)
full_globe_temp

As we notice, our first try did not gave us the table that we wanted. So we can use the parameter `sep="\s+"` in order to separate the information into two columns. Moreover, we can also change the tittle for each column using the parameter names=["tittle1", "tittle2, ..."].

In [None]:
full_globe_temp = pd.read_table(filename, sep="\s+", names=["year", "mean temp"])
full_globe_temp

#### From a chunked file

When we are trying to get a table from a file that contains multiple chucked files we can use the following code:

In [None]:
giss_temp = pd.read_table("Global-Land-Ocean-Temperature-Index-1951-1980.txt", sep="\s+", skiprows=7,
                          skip_footer=11, engine="python")
giss_temp

We can notice that the graph should be indexed by the year, so we can change that with the command `set_index("Name_of_index_column")`

In [None]:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()

The parameters `skiprows=7` and `skip_footer=11` indicates the lines that we want to avoid. In this particular example we wanted to skip from row 7 to row 11 since there was a large empty space in the original file.

#### From a local or remote HTML file

Many times we would like to work with online data that could be easily imported with the follwing code:

In [None]:
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt", 
                                   sep="\s+")
northern_sea_level

This time I used the parameter `sep="\s+"` once again since the table was not displaying the data as I wanted. All the data was stored in one single column. As you can see, we just need an URL in order to import data from the internet. 

### 2. Pandas datastructures

In this section we will explore the function `pd.DataFrame`. I will start by uploading one new table from the net in the same way I did before.

In [None]:
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt", 
                                   sep="\s+")
southern_sea_level

I will use the function [`pd.DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) to create a new table including the `msl_ib(mm)` from the northern sea level and southern sea level tables. This table will be indexed by the year of the southern sea level table.

In [None]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values, 
                               #Indicates the column name and the values area that we want to be displayed.
                               "southern_hem": southern_sea_level["msl_ib(mm)"].values},
                               index = southern_sea_level.year)
                                #Change the index to be represented by the year of the southern sea level.
mean_sea_level

### 3. Cleaning and formatting data

From the previous datastructure we can see that the year looks more like a date, we can change this by setting the index name as date.

In [None]:
mean_sea_level.index.name = "date"
mean_sea_level

In the full globe table, the number -999.00 was used to represent a year that data was no collected. We can make Python understand this by setting these temperatures equal to `np.nan` that stands for not a number. 

In [None]:
full_globe_temp[full_globe_temp == -999.000] = np.nan
full_globe_temp.tail(n=3) 
# Shows the last 3 rows of data

We can use the `.drop` method to eliminate a row or column that is not interesting to us:

In [None]:
giss_temp = giss_temp.drop("Year.1", axis=1)
# We first specify the name of the column or row and then we specify axis equal 0 or 1.
# Axis=1 refers to the y-axis and axis=0 refers to the x-axis.
giss_temp

We can just select the columns that we want to pick:

In [None]:
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun']]
# Select columns like if they were elements of a list.
giss_temp

Now we can eliminate the last row that is a repetition of the columns titles by using the method `.drop` and the name of the indexed row.

In [None]:
giss_temp = giss_temp.drop("Year")
giss_temp

As we can see, the last row has some unknown values. So, if we want to explain Python that those symbols are not numbers we can use `np.nan` again.

In [None]:
giss_temp = giss_temp.where(giss_temp != "****", np.nan)
giss_temp

Now if we want to delete these missing values we can use the method `dropna`.

In [None]:
giss_temp.dropna(how="any").tail() # This basically removes all the rows that contain NaN values.

If we want to keep all the rows that contains `NaN` values, change "any" for "all". Or we want to change those NaN values with any other values we can use the method `fillna`.

In [None]:
giss_temp.fillna(value=0).tail()
# Change the NaN values for 0. 

In [None]:
giss_temp.fillna(method="ffill").tail()

### 4. Basic visualization

In order to make graphs of our tables we need to convert the type of the values to `np.float32`, for this reason we can create a for loop that goes over each column element.

In [None]:
for col in giss_temp.columns: 
    # For loop over columns in a table.
    giss_temp.loc[:, col] = giss_temp[col].astype(np.float32)
    # Makes every element in the columns a float32.

We should also make each index value an `int32` if we want to use the method `plot`.

In [None]:
giss_temp.index = giss_temp.index.astype(np.int32)

Now, we can procede to create our plot.

In [None]:
giss_temp.plot(figsize=LARGE_FIGSIZE)
# figsize set the size of the graph.

We can also directly create a box plot for this table of temperature in each month since 1880:

In [None]:
giss_temp.boxplot();

### 5. References

All the data files used in this tutorial have been taken from Pandas tutorial for SciPy2015 and SciPy2016 conference. These data files can be taken from this [github account](https://github.com/jonathanrocher/pandas_tutorial). The online data about sea levels has been taken from the University of Colorado. This tutorial was based on some examples used in this video conference, if you want to learn more about Pandas or find more details about the examples used in this tutorial, please watch the whole conference.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("6ohWS7J1hVA")