# MATH 210 Project 1

## Pandas 

### Introduction



As a Programming language built in `Python`, **Panda** ("Panel data") is a powerful package for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables. (see more on [documentation](http://pandas.pydata.org/))

**Our goal** in this note book is to explore **two** important subpackages in Panda: **Series** and **DataFrame**. They provide us a firm basis for many applications.

By the end of the notebook, the reader will be able to construct their own data structure with Series and DataFrame.

* `Series` [documentation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)
* `DataFrame` [documentation](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)


## Contents

1. Series
2. DataFrame
3. Exersices

In [None]:
import numpy as np
import pandas as pd

### 1. Series

`Series` is a one-dimensional labeled array, containing an array of data (any NumPy datatype) and an array of indexes.

The simplest array can have only array of data:

In [None]:
array1 = pd.Series([3,10,5,2])
array1

We can tell from the above example that our *data* is  on the right hand side, while the *index* is on the left hand side. Notice that the default index starts with 0.



#### From ndarray

If our data is an ndarray, our **index** must be the same length as **data**. 

We can call

`a = pd.Series(data, index)` 

which is a basic method to create a Series with index with data from ndarray.

For example:

In [None]:
array2 = pd.Series([3,10,5,2],['a','b','c','d'])
array2

We can pick the whole array of *data* or *index*  simply by calling:
* `a.values` (calls for data)
* `a.index`(calss for index)

In [None]:
array2.values

In [None]:
array2.index

We can also pick a specific data from the Series by calling its assigned index.

For example:

In [None]:
array2['a']

In [None]:
array2[['a','d']]

When doing operations between series, it is quite similar with most NumPy methods expecting an ndarray.

In [None]:
array2 + array2

#### From dictionary

If our data is a dictionary, we can import our data with this format:

In [None]:
data1 = {'a':3,'b':10,'c':5,'d':2}
array3 = pd.Series(data1)
array3

In [None]:
array4 = pd.Series(data1,index = ['a','e','c'])
array4

*Note*: `NaN` is the standard missing data marker used in pandas. 

We can check which data is missing with `isnull` or `notnull`:

In [None]:
array4.isnull()

We can also name the series and the index when using series. The index of series can be also renamed anytime.

In [None]:
array3.name = 'example'
array3.index.name = 'index_name'
array3

In [None]:
array3.index = ['one','two','three','four']
array3

### 2. DataFrame

**DataFrame** is a 2D labeled data structure with columns of data (may be different types). It is just like a spreadsheet. The data in a DataFrame is actually stored in memory as collection of Series.



#### From ndarrays/lists

There are lots of ways to construct a DataFrame, but the most commonly used one is using several NumPy arrays or dicts with the same length.

For example:

In [None]:
dt = {'country':['China','USA','Canada','Russia'],
        'population':[1390,318,35,143],
        'GDP':[18.6,11.4,1.5,1.3]}
frame = pd.DataFrame(dt)
frame

#### From a list of dicts

In [None]:
data = [{'a':1,'b':2},{'a':5,'b':10}]
pd.DataFrame(data)

#### From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series.

In [None]:
pd.DataFrame(array2)

As we can see above, the index is constructed automatically based on common sense rules, but we can assign the index by ourselves:

Back to our first example(country&GDP&population)

In [None]:
frame1 = pd.DataFrame(dt,index=['one','two','three','four'])
frame1

We can rearange the order of the columns, and if we pass a column to the existed dataframe that is not included, it will show as `NaN`

In [None]:
frame2 = pd.DataFrame(dt, columns=['country','population','GDP','area'])

In [None]:
frame2

We can pass values to the empty part:

In [None]:
frame2['area']=[9.6,9.8,10,17]
frame2

Just like Series, we can pick a specific row or column from the DataFrame with the same syntax. In general, we can trat a DataFrame like a dict of indexed Series objects, and operations work with the same syntax.

In [None]:
frame2['country'] # Select a specific column

In [None]:
frame2.index=['one','two','three','four'] # Assign index
frame2

In [None]:
frame2['preson/area'] = frame2['population']/frame2['area']
frame2['space']=frame2['area']>10
frame2

In [None]:
del frame2['GDP'] # delete a column
frame2

**Note**: when adding a new column to DataFrame, it has to have the same length as the DataFrame, which is different from Series.

### 3. Exersices

1. Construct a data structure that shows some of the courses you are taking and class size using `DataFrame` and `Series`.

2. Construct a `DataFrame` that can compute the GDP per capita in our example.