## Pandas Basics Tutorial 

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy's ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.


## Topics: 

- Introduction to Pandas Objects
- Data Indexing and Selection
- Operations-in-Pandas


### references: 

Code mostly taken from:

Python Data Science Handbook by Jake VanderPlas

https://www.coursera.org/specializations/statistics-with-python

In [None]:
import pandas as pd 

## Numpy DataStructures:

three fundamental Pandas data structures: the Series, DataFrame, and Index.

### Series

pd.Series(data, index=index)

where index is optional and data can be one of many entities

and data can be 

- data can be a list or NumPy array, in which case index defaults to an integer sequence:

e.g pd.Series([2, 4, 6])

- data can be a scalar, which is repeated to fill the specified index:

e.g. pd.Series(5, index=[100, 200, 300])

- data can be a dictionary, in which index defaults to the sorted dictionary keys:

e.g. pd.Series({2:'a', 1:'b', 3:'c'})


### Series as generealised Numpy Array

Series is like one-dimensional Numpy array but as Numpy Array uses implicitly defined integer index

In pandas we can explicitly define index associated with values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:




In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
               

data['b']


0.5

In [None]:
# We can even use non-contiguous or non-sequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data[5]


0.5

#### Series as specialized dictionary
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [None]:
#The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64



### Data Frame
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.





If a Series is an analog of a one-dimensional array with flexible indices, 
a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. 
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, 
you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.



In [None]:
#To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the Series object, the DataFrame has an index attribute that gives access to the index labels:



In [None]:
states.index


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [None]:
states.columns

Index(['population', 'area'], dtype='object')

## DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [None]:
states['area']


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

## Creating Pandas Frame Objects

We have a number of way to construct DataFrame
- From series object
- From list of dicts
- From a dictionary of series object
- From a two-dimensional NumPy array


### Importing data

Pandas has a variety of functions for reading data named 'read_xxx' for reading data in various formats. 


In [None]:
df = pd.read_csv('data/Cartwheeldata.csv', delimiter=',') 
type (df)

pandas.core.frame.DataFrame

In [None]:
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

### Viewing Data

In [None]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [None]:
df #print entire dataframe

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


In [None]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

## Indexers: loc, iloc, and ix

Accessing elements - implicit vs explicit indexing 

### .loc ()

First, the loc attribute allows indexing and slicing that always references the explicit index:

.loc() takes two single/list/range operator seperated by ','. First one indicates the row and second one indicates columns  

In [None]:
df.loc[:,"CWDistance"] #returns all observations of CWDistance

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [None]:
# select all rows for multiple columns 

df.loc [:,['CWDistance', 'Height','Wingspan']]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [None]:
#select first 9 rows of the above columns

df.loc [:9,['CWDistance', 'Height','Wingspan']]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [None]:
#select a range of rows for all columns 

df.loc [10:15]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
11,12,28,F,1,Y,1,62.75,58.0,79,Y,1,10
12,13,25,F,1,Y,1,65.0,64.5,92,Y,1,6
13,14,23,F,1,N,0,61.5,57.5,66,Y,1,4
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


In [None]:
df.Gender.unique() #apply unique function for column Gender

array(['F', 'M'], dtype=object)

### Summarizing and Computing Descriptive Statistics

The median is, generally speaking, a more appropriate choice than the mean. It reduces the influence of outliers, which is a
particular problem in the case of mass experiments where a certain fraction of your participants are likely to be bozos. 



In [None]:
df['Age'].describe()



count    25.000000
mean     28.240000
std       6.989754
min      22.000000
25%      24.000000
50%      26.000000
75%      29.000000
max      56.000000
Name: Age, dtype: float64

We can access these individually as follows: 

df.mean()

df.min()

df.max()

df.count()

In [None]:
df.sum(axis=1)


0     269.00
1     231.00
2     261.00
3     269.00
4     258.00
5     262.00
6     306.00
7     266.00
8     299.00
9     242.00
10    282.50
11    252.75
12    268.50
13    228.00
14    278.00
15    310.00
16    265.00
17    259.00
18    252.00
19    274.00
20    251.00
21    304.00
22    275.00
23    261.00
24    248.00
dtype: float64

### ## Filtering and Sorting 


The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
df.sort_values(by=['ID', 'Age'], ascending=False)

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
24,25,23,F,1,Y,1,65.0,63.0,67,N,0,3
23,24,26,M,2,N,0,69.0,71.0,63,Y,1,5
22,23,25,M,2,N,0,70.0,68.0,82,Y,1,4
21,22,29,M,2,N,0,71.0,70.0,101,Y,1,8
20,21,23,M,2,Y,1,69.0,67.0,66,N,0,2
19,20,24,F,1,Y,1,68.0,66.0,85,Y,1,8
18,19,23,M,2,Y,1,70.0,69.0,64,Y,1,3
17,18,27,M,2,N,0,66.0,66.0,74,Y,1,5
16,17,26,F,1,N,0,61.5,59.5,90,N,0,10
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


In [None]:
df.query('Age>12')

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


In [None]:
#Filtering with multiple clauses - use column names as Python Attributes 
#or columns with df[] syntax

df[(df.Age>10) & (df['Height'] > 65) ]


Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6
17,18,27,M,2,N,0,66.0,66.0,74,Y,1,5
18,19,23,M,2,Y,1,70.0,69.0,64,Y,1,3
