# Data Analysis and manipulation with Pandas

## Pandas Introduction

``Pandas`` is an open source Python library for data analysis. It gives Python the
ability to work with spreadsheet-like data for fast data loading, manipulating,
aligning, merging, etc. To give Python these enhanced features, Pandas
introduces two new data types to Python: ``Series`` and ``DataFrame``. The
DataFrame will represent your entire spreadsheet or rectangular data, whereas
the Series is a single column of the DataFrame. A Pandas DataFrame can also
be thought of as a dictionary or collection of Series.

While NumPy and its ndarray object, which provides efficient storage and manipulation 
of dense typed arrays in Python. Pandas is a package built on top of NumPy, and provides an
efficient implementation of a DataFrame. DataFrames are essentially multidimensional
arrays with attached row and column labels, and often with heterogeneous
types and/or missing data. As well as offering a convenient storage interface for
labeled data, Pandas implements a number of powerful data operations familiar to
users of both database frameworks and spreadsheet programs.

NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

Import ``pandas`` under the alias pd and check the version:

In [None]:
import pandas as pd
pd.__version__

***Reminder About Built-In Documentation***

Remember, IPython gives you the ability to quickly explore the contents of a package 
(by using the tab-completion feature) as well as the documentation of various functions 
(using the ? character).

For example, to display all the contents of the pandas namespace, you can type this:

pd.<TAB>

And to display the built-in Pandas documentation, you can use this:

pd?

More detailed documentation, along with tutorials and other resources, can be found
at [Pandas](http://pandas.pydata.org/.)

### Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices. 

To get started with pandas, we will need to get comfortable with its two workhorse
data structures: Series and DataFrame.

#### Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data:

Numpy and Pandas imports under the aliases np and pd:

In [None]:
import numpy as np
import pandas as pd

In [None]:
series_data = pd.Series([0.0, 0.25, 0.5, 0.75, 1.0])
series_data

The Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes. The
values are simply a familiar NumPy array:

In [None]:
series_data.values

The index is an array-like object of type pd.Index.

In [None]:
series_data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:

In [None]:
series_data

In [None]:
series_data[2]

In [None]:
series_data[1:5]

**Series as generalized NumPy array**

From what we’ve seen so far, it may look like the Series object is basically interchangeable
with a one-dimensional NumPy array. The essential difference is the presence
of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values.

This explicit index definition gives the Series object additional capabilities. For
example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:

In [None]:
gdp_per_capita = pd.Series([113196,83716, 81151, 77975, 77771, 69687, 67037, 65111, 63987, 2171],
index=['Luxembourg', 'Switzerland', 'Macau', 'Norway', 'Ireland', 'Qatar', 'Iceland', 'United States', 'Singapore', 'India'])

In [None]:
gdp_per_capita

In [None]:
gdp_per_capita['Norway']

In [None]:
gdp_per_capita[3]

And the item access works as expected:

In [None]:
gdp_per_capita['United States']

**Task**

Find the gdp_per_capita for `India` as gdp_India

In [None]:
## Write 1 line of code.

#YOUR CODE HERE

print(gdp_India)

#### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays.

In [None]:
import pandas as pd

In [None]:
dict1 = {'name':['john'],'age':[20]}
df1 = pd.DataFrame(dict1,columns=['name','NAME','age'])
df1
# How to create a dataframe from dictionary

In [None]:
df1.columns = ['name1', 'NAME1', 'age1']
df1

In [None]:
list1 = [['john',20],['xavier',30]]
pd.DataFrame(list1,columns=['name','age'],index=['R1','R2'])

In [None]:
# How to create a dataframe from list

stu_name = ['jhon','xavier','chris','kristien']
age = [18,19,20,18]
stu_id = [101,102,103,104]

stu_data = {'stu_name':stu_name,'age':age,'stu_id':stu_id}
pd.DataFrame(stu_data)

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:

### Indexing

In [None]:
dense_pop_data = {'country': ['India', 'Pakistan', 'Bangladesh', 'Japan', 'Philippines', 'Vietnam', 'United Kingdom', 'South Korea', 'Taiwan', 'Sri Lanka'],
'population': [1360780000, 219210000, 168410000, 126010000, 108510000, 96208984, 66435600, 51780579, 23604265, 21803000],
'area_in_sq_km': [3287240, 803940, 143998, 377873, 300000, 331689, 243610, 99538, 36193, 65610],
'density_pop_per_sq_km':[414, 273, 1170, 333, 362, 290, 273, 520, 652, 332],
'notes':['Growing population', 'Growing population', 'Rapidly growing population', 'Declining population', 'Growing population', 
         'Growing population', 'Steady population', 'Steady population', 'Steady population', 'Growing population']}

In [None]:
dense_df = pd.DataFrame(dense_pop_data)
dense_df

The resulting DataFrame will have its index assigned automatically as with Series.

A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute:

In [None]:
dense_df['population']

In [None]:
dense_df.area_in_sq_km

``df[column]`` works for any column name, but ``df.column`` only works when the column name is a valid Python variable
name.

### Data Indexing and Selection

#### Series
##### Data Selection in Series

In [None]:
series_data = pd.Series([0.0, 0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd', 'e'])
print(series_data)

In [None]:
# numpyarray = np.array(series_data)
# numpyarray[1]


In [None]:
dict(series_data)

In [None]:
series_data['b']

We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:

In [None]:
'a' in series_data

In [None]:
series_data.keys()

In [None]:
list(series_data.items())

Series objects can even be modified with a dictionary-like syntax. Just as you can
extend a dictionary by assigning to a new key, you can extend a Series by assigning
to a new index value:

In [None]:
series_data['f'] = 1.25

In [None]:
series_data

In [None]:
# slicing by explicit index
series_data['a':'c']

In [None]:
# slicing by implicit integer index
series_data[0:2]

In [None]:
# masking
series_data[(series_data > 0.3) & (series_data < 0.8)]

In [None]:
# fancy indexing
series_data[['a', 'e']]

*Notice that when you are  slicing with an explicit index (i.e., series_data['a':'c']), the final index is included in the slice, while when you’re slicing with an implicit index (i.e., series_data[0:2]), the final index is excluded from the slice.*

##### Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For example, if
your Series has an explicit integer index, an indexing operation such as data[1] will
use the explicit indices, while a slicing operation like data[1:3] will use the implicit
Python-style index.

In [None]:
import pandas as pd

data = pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1, 3, 5, 7, 9])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes that explicitly expose certain indexing schemes. These
are not functional methods, but attributes that expose a particular slicing interface to
the data in the Series.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit
index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:5]

The ``iloc`` attribute allows indexing and slicing that always references the implicit
Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

**Task**

Find the first 4 series values from beginning into `part_series`

In [None]:
## Write 1 line of code.

# YOUR CODE HERE

print(part_series)

In [None]:
part_series_sol =  data.iloc[0:4]

**Task**

Add a new data item `f` to the series at index `11`

In [None]:
# YOUR CODE HERE
data.loc[11] = 'f'
print(data)

In [None]:
assert data[11] == 'flop'

#### Data Selection in DataFrame

In [None]:
dense_pop_data = {'country': ['India', 'Pakistan', 'Bangladesh', 'Japan', 'Philippines', 'Vietnam', 'United Kingdom', 'South Korea', 'Taiwan', 'Sri Lanka'],
'population': [1360780000, 219210000, 168410000, 126010000, 108510000, 96208984, 66435600, 51780579, 23604265, 21803000],
'area_in_sq_km': [3287240, 803940, 143998, 377873, 300000, 331689, 243610, 99538, 36193, 65610],
'density_pop_per_sq_km':[414, 273, 1170, 333, 362, 290, 273, 520, 652, 332],
'notes':['Growing population', 'Growing population', 'Rapidly growing population', 'Declining population', 'Growing population', 
         'Growing population', 'Steady population', 'Steady population', 'Steady population', 'Growing population']}

In [None]:
dense_df = pd.DataFrame(dense_pop_data)

The individual Series that make up the columns of the DataFrame can be accessed
via dictionary-style indexing of the column name:

In [None]:
dense_df['country']

Equivalently, we can use attribute-style access with column names that are strings:

In [None]:
dense_df.country

**DataFrame as two-dimensional array**

DataFrame as an enhanced twodimensional array. We can examine the raw underlying data array using the values
attribute:

In [None]:
dense_df.values

Transpose the full DataFrame to swap rows and columns:

In [None]:
dense_df.T

Using indexers

In [None]:
dense_df

In [None]:
dense_df[1:2]

**Selection with ``loc`` and ``iloc``**

For DataFrame label-indexing on the rows, the special indexing operators
``loc`` and ``iloc`` enable us to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either ``axis labels (loc)`` or ``integers (iloc)``.

In [None]:
dense_df.loc[1:4,['country', 'notes']]

In [None]:
dense_df.iloc[1:4,0:2]

In [None]:
dense_df.iloc[[1, 0], [0, 2, 1]]

In [None]:
dense_df.iloc[2, 1]

In [None]:
dense_df.iloc[:, :3][dense_df.density_pop_per_sq_km > 400]

``Indexing operations with datafrmaes``
- ``df[val]``: Select single column or sequence of columns from the DataFrame; special case
conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame
- ``df.loc[val]``: Selects single row or subset of rows from the DataFrame by label
- ``df.loc[:, val]``: Selects single column or subset of columns by label
- ``df.loc[val1, val2]``: Select both rows and columns by label
- ``df.iloc[where]``: Selects single row or subset of rows from the DataFrame by integer position
- ``df.iloc[:, where]``: Selects single column or subset of columns by integer position
- ``df.iloc``: [where_i, where_j] Select both rows and columns by integer position

The ``iloc`` method is used to select rows and columns by integer position. 

You can pass it:
- A single integer position
- A list of integer positions
- A slice of integer positions
- A colon (which indicates "all integer positions")

The ``loc`` method is used to select rows and columns by label. 

You can pass it:
- A single label
- A list of labels
- A slice of labels
- A boolean Series
- A colon (which indicates "all labels")

### Indexing ###

An important method on pandas objects is reindex, which means to create a new
object with the data conformed to a new index.

In [None]:
dense_df

In [None]:
# Set country as index
dense_df.set_index('country', inplace = True)

In [None]:
dense_df

In [None]:
dense_df.loc['United Kingdom']

In [None]:
dense_df.iloc[0]

In [None]:
dense_df.loc['Japan':'Sri Lanka']

In [None]:
dense_df[0:5]

In [None]:
dense_df

In [None]:
dense_df = dense_df.sort_index()
dense_df

In [None]:
dense_df.sort_values(['population'])

## Merge and concat

In [None]:
import numpy as np

In [None]:
df1 = pd.DataFrame({
'A':[1,2,3,4],
'B':[True,False,True,True],
'C':['C1','C2','C3','C4']
})
df2 = pd.DataFrame({
'A':[5,7,8,5],
'B':[False,False,True,False],
'D':['D1','D2','D3','D4']
})

In [None]:
print(df1)
print(df2)

## Concat

In [None]:
pd.concat([df1,df2],axis=0)

In [None]:
pd.concat([df1,df2],axis=1)

In [None]:
pd.concat([df1,df2],axis=0,ignore_index=True)

In [None]:
df1 = pd.DataFrame({
'A':[1,2,3,4],
'B':[True,False,True,True],
'C':['C1','C2','C3','C4']
})
df2 = pd.DataFrame({
'A':[5,7,8,'NULL'],
'B':[False,False,True,False],
'D':['D1','D2','D3','NULL']
})

In [None]:

print(df1)
print(df2)

In [None]:
pd.concat([df1,df2],axis=1)

In [None]:
pd.concat([df1,df2],axis=0)

### Merge

In [None]:
df1 = pd.DataFrame({
'A':[1,2,3,4],
'B':[True,False,True,True],
'C':['C1','C2','C3','C4']
})
df2 = pd.DataFrame({
'A':[4,7,8,'NULL'],
'B':[True,False,True,False],
'D':['D1','D2','D3','NULL']
})

In [None]:
print(df1)
print(df2)

In [None]:
?pd.merge