# PANDAS 

## Series 

In [1]:
import pandas as pd

In [2]:
obj = pd.Series([4, 7, -5, 3]);obj

0    4
1    7
2   -5
3    3
dtype: int64

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [3]:
obj.values

array([ 4,  7, -5,  3])

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

We can assign indices to data points

In [5]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [6]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

you can use labels in the index when selecting single values or a set of values:

In [7]:
obj2['a']

-5

In [8]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [9]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [10]:
obj2 > 0

d     True
b     True
a    False
c     True
dtype: bool

In [11]:
obj2*2

d     8
b    14
a   -10
c     6
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a map‐ ping of index values to data values. 
<br> It can be used in many contexts where you might use a dict:

In [12]:
 'b' in obj2

True

In [13]:
 'e' in obj2

False

if you have dictionary you can create a series from it

In [14]:
 sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [15]:
obj3 = pd.Series(sdata)

In [16]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. 
<br> You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [18]:
obj4 = pd.Series(sdata, index=states);obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is con‐ sidered in pandas to mark missing or NA values. 
<br> Since 'Utah' was not included in states, it is excluded from the resulting object.

The isnull and notnull functions in pandas should be used to detect missing data:

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods:

In [21]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Both the Series object itself and its index have a name attribute which could be important in other parts of pandas

In [22]:
obj4.name = 'population'

In [23]:
obj4.index.name = 'state'

In [24]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.

To create a dataframe you should call pd.DataFrame() function

In [25]:
df=pd.DataFrame([[1,2,3],[4,5,6]]);df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


Since we did not explicity set indices and columns it assigned numbers from 0 to N

In [26]:
df=pd.DataFrame([[1,2,3],[4,5,6]],index=["a","b"],columns=list("klm"));df

Unnamed: 0,k,l,m
a,1,2,3
b,4,5,6


There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [27]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [28]:
data

{'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003]}

In [29]:
frame = pd.DataFrame(data);frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


# Getting columns of a dataframe

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [30]:
frame["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [31]:
frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

get the year columns

frame2[column] works for any column name, but frame2.column only works when the column name is a valid Python variable name.
<br> for example if there is space in the column name frame2.column will not work

Columns can be modified by assignment

In [32]:
frame['debt'] = 16.5 ; frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,16.5
2,Ohio,2002,3.6,16.5
3,Nevada,2001,2.4,16.5
4,Nevada,2002,2.9,16.5
5,Nevada,2003,3.2,16.5


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame

In [33]:
import numpy as np

In [34]:
frame['debt'] = np.arange(10,16) ; frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,10
1,Ohio,2001,1.7,11
2,Ohio,2002,3.6,12
3,Nevada,2001,2.4,13
4,Nevada,2002,2.9,14
5,Nevada,2003,3.2,15


Try above script np.arange(10,17)

If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [35]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']); val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

first create a dataframe with indices name this dataframe as frame2

In [36]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                          index=['one', 'two', 'three', 'four',
                                     'five', 'six'])

In [37]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


assing debt column with np.arange(1,7)

now assign debt column as val series

In [38]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


The del keyword will delete columns as with a dict.

In [39]:
del frame2["debt"];frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


to create a new column you should assignment

In [40]:
frame2["size"]=np.arange(100,700,100);frame2

Unnamed: 0,year,state,pop,size
one,2000,Ohio,1.5,100
two,2001,Ohio,1.7,200
three,2002,Ohio,3.6,300
four,2001,Nevada,2.4,400
five,2002,Nevada,2.9,500
six,2003,Nevada,3.2,600


frame2.columns and frame2.index return columsn and indices

In [41]:
frame2.columns 

Index(['year', 'state', 'pop', 'size'], dtype='object')

In [42]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

In [43]:
frame2.values

array([[2000, 'Ohio', 1.5, 100],
       [2001, 'Ohio', 1.7, 200],
       [2002, 'Ohio', 3.6, 300],
       [2001, 'Nevada', 2.4, 400],
       [2002, 'Nevada', 2.9, 500],
       [2003, 'Nevada', 3.2, 600]], dtype=object)

If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns:

## Possible data inputs to DataFrame constructor

![Screen%20Shot%202018-11-09%20at%2018.15.54.png](attachment:Screen%20Shot%202018-11-09%20at%2018.15.54.png)

## Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [44]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']);obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [45]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']);obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

With DataFrame, reindex can alter either the (row) index, columns, or both. 
<br> When passed only a sequence, it reindexes the rows in the result:

In [46]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                             index=['a', 'c', 'd'],
                            columns=['Ohio', 'Texas', 'California'])

In [47]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [48]:
frame2 = frame.reindex(['a', 'b', 'c', 'd']); frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the columns keyword:

In [49]:
states = ['Texas', 'Utah', 'California']

In [50]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


## Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. 
<br> drop method will return a new object with the indicated value or values deleted from an axis:

In [51]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']);obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [52]:
 new_obj = obj.drop('c') ; new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [53]:
 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                           index=['Ohio', 'Colorado', 'Utah', 'New York'],
                           columns=['one', 'two', 'three', 'four'])


In [54]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [55]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the columns by passing axis=1 or axis='columns':

In [56]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


drop columns "one" and "four"

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:

In [57]:
 data.drop(['two', 'three'], axis='columns',inplace=True)

In [58]:
data

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


## Indexing, Selection, and Filtering

In [59]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                           index=['Ohio', 'Colorado', 'Utah', 'New York'],
                           columns=['one', 'two', 'three', 'four'])

In [60]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [61]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [62]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


The row selection syntax data[:2] is provided as a convenience. 
<br> Passing a single element or a list to the [] operator selects columns.

## booelan indexing 

In [63]:
data['three'] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [64]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Selection with loc and iloc

we can select a subset of rows or columns
<br> iloc: positional index
<br> loc: label index

In [65]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [66]:
 data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [67]:
 data.loc['Colorado']

one      4
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [68]:
 data.loc[['Colorado',"Utah"]]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11


In [69]:
 data.loc[:,['two',"four"]]

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


get ohio and utah rows

get only "one" "three" columns

get ohio and utah rows and "one","three"columns

get the first two rows

get the last two columns

get first two rows and last two columns

# Function Application and Mapping

frame.apply(): Applying a function on one-dimensional arrays to each column or row

In [70]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [71]:
frame

Unnamed: 0,b,d,e
Utah,0.662167,-0.198459,0.639036
Ohio,-1.006596,1.226871,2.024934
Texas,0.644239,1.616828,-0.692944
Oregon,0.086828,0.055211,0.861726


In [72]:
f = lambda x: x.max() - x.min()

In [73]:
frame.apply(f)

b    1.668762
d    1.815287
e    2.717878
dtype: float64

In [74]:
frame.apply(lambda x: x.max() - x.min())

b    1.668762
d    1.815287
e    2.717878
dtype: float64

In [75]:
frame.apply(f,axis="columns")

Utah      0.860625
Ohio      3.031530
Texas     2.309772
Oregon    0.806515
dtype: float64

In [76]:
frame.apply(f,axis=1)

Utah      0.860625
Ohio      3.031530
Texas     2.309772
Oregon    0.806515
dtype: float64

### The function passed to apply need not return a scalar value; it can also return a Series with multiple values:

In [77]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [78]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.006596,-0.198459,-0.692944
max,0.662167,1.616828,2.024934


## to apply on all the elements use applymap() function

In [79]:
myfunct=lambda x: x**2

In [80]:
frame.applymap(myfunct)

Unnamed: 0,b,d,e
Utah,0.438464,0.039386,0.408367
Ohio,1.013235,1.505212,4.100358
Texas,0.415044,2.614134,0.480171
Oregon,0.007539,0.003048,0.742571


## To apply element-wise on series use map() function 

In [81]:
 frame['e'].map(myfunct)

Utah      0.408367
Ohio      4.100358
Texas     0.480171
Oregon    0.742571
Name: e, dtype: float64

# Sorting

## Sorting index 

In [88]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [89]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [84]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

Try sortin index with mixed data type string and integers

In [85]:
obj = pd.Series(range(8), index=['d', 'a', 'b', 'c',4,2,3,1])

In [86]:
obj

d    0
a    1
b    2
c    3
4    4
2    5
3    6
1    7
dtype: int64

In [87]:
obj.sort_index()

TypeError: ignored

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                            index=['three', 'one'],
                         columns=['d', 'a', 'b', 'c'])

In [None]:
frame

In [None]:
frame.sort_index()

In [None]:
frame.sort_index(axis=1)

In [None]:
frame

by default it is not a inplace operation, to make it inplace inplace=True

In [None]:
frame.sort_index(axis=1,inplace=True)

In [None]:
frame

## Sorting by values 

In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [None]:
frame

In [None]:
frame.sort_values(by="a")

To sort by multiple columns, pass a list of names:

In [None]:
frame.sort_values(by=['a', 'b'])

## Axis Indexes with Duplicate Labels

In [None]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [None]:
obj

The index’s is_unique property can tell you whether its labels are unique or not:

In [None]:
obj.index.is_unique

Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [None]:
obj['a']

In [None]:
obj['c']

This can make your code more complicated, as the output type from indexing can
vary based on whether a label is repeated or not.

In [None]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [None]:
df

In [None]:
df.loc['b']

# Descriptive Statistics

In [None]:
df = pd.DataFrame([[2.0, np.nan], [7, -4], [np.nan, np.nan], [1, -2]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])

In [None]:
df

Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

Passing axis='columns' or axis=1 sums across the columns instead:

In [None]:
df.sum(axis=1)

NA values are excluded unless the entire slice (row or column in this case) is NA. 
<br> This can be disabled with the skipna option:

In [None]:
 df.mean(axis='columns', skipna=False)

producing multiple summary statistics in one shot:

In [None]:
df.describe()

### Descriptive and summary statistics

<div>
<img src="attachment:Screen%20Shot%202018-11-11%20at%2021.27.48.png" width="600" height="450" >
<div>

Example for pct_change:

In [None]:
df = pd.DataFrame({
  'FR': [4.0405, 4.0963, 4.3149],
   'GR': [1.7246, 1.7482, 1.8519],
    'IT': [804.74, 810.01, 860.13]},
.    index=['1980-01-01', '1980-02-01', '1980-03-01'])

In [None]:
df

In [None]:
df.pct_change()

Periods to shift for forming percent change.

In [None]:
df.pct_change(periods=2)

In [None]:
np.random.seed(1)
df=pd.DataFrame(np.random.randint(5,10,(4,4)),index=list("abcd"),columns=list("xyzt"));df

In [None]:
df.diff()

In [None]:
df.diff(axis="columns")

# Unique Values, Value Counts, and Membership

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']);obj

In [None]:
uniques = obj.unique()

In [None]:
uniques

In [None]:
obj.value_counts()

isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame

In [None]:
mask = obj.isin(['b', 'c'])

In [None]:
mask

In [None]:
obj[mask]

In [None]:
np.random.seed(1)
df=pd.DataFrame(np.random.randint(5,10,(4,4)),index=list("abcd"),columns=list("xyzt"));df

In [None]:
df.value_counts()

In [None]:
df.x.value_counts()

In [None]:
df.apply(pd.value_counts)

In [None]:
df.apply(pd.value_counts).fillna(0)