## Pandas

- Pandas is a popular Python data manipulation and analysis tool.
- It provides easy to use and highly efficient data structures.
- These data structures deal with numeric or labeled data, stored in the form of tables
- These data structures are built on top of Numpy array, which means they are fast.


#### Features:

- Data Wrangling
- Modeling
- Visualization
- High Performance Computing
- Big Data
- Statistical Computing
- Numerical Computing
- Data Mining
- Text Processing
- Computing Language
- Educational Outreach
- Computational thinking

There are three fundamental data structures used in pandas are:<br>
<b>1) Series:</b> A 1-D array object similar to an array,list or coumn in a table.holding data values of a single variable, captured from multiple observations.<br>
- It is a homogeneous data with immutable size<br>
- If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].
- <b>Syntax:</b> pandas.Series( data, index, dtype, copy)
 - data: data takes various forms like ndarray, list, constants
 - dtype:dtype is for data type. If None, data type will be inferred
 - index: Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.
 - copy:Copy data. Default False
dtype is for data type. If None, data type will be inferred

<b>2) Data Frame:</b> A 2-D array or two or more Series joined together<br>
- It is a heterogeneous data with mutable size<br>
- <b>Syntax:</b>pandas.DataFrame( data, index, columns, dtype, copy)
 - data: data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
 - dtype: Data type of each column.
 - index: For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.
 - copy: This command (or whatever it is) is used for copying of data, if the default is False.
 - columns:For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.<br>
 
<b>3) Panel:</b> A 3-D array.It is hard to represent the panel in graphical representation But a panel can be illustrated as a container of DataFrame.<br>
-- <b>Syntax:</b>pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
- It is a heterogeneous data with mutable size<br>

In [2]:
import numpy as np
import pandas as pd

## Object creation

In [3]:
# create a Series with an arbitrary list
s = pd.Series([1, 4, np.nan, 6, 8])
s

0    1.0
1    4.0
2    NaN
3    6.0
4    8.0
dtype: float64

In [4]:
#you can specify an index to use when creating the Series
s = pd.Series([1, 4, np.nan, 6, 8],index=['A', 'Z', 'C', 'Y', 'E'])
s

A    1.0
Z    4.0
C    NaN
Y    6.0
E    8.0
dtype: float64

In [13]:
#Series with numpy array
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
s

100    a
101    b
102    c
103    d
dtype: object

In [7]:
#with dictionary
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
s

a    0.0
b    1.0
c    2.0
dtype: float64

In [5]:
#If data is a scalar value
s = pd.Series(5, index=[0, 1, 2, 3])
s

0    5
1    5
2    5
3    5
dtype: int64

In [6]:
#To get actual data we can use this .array also
s.array

<PandasArray>
[5, 5, 5, 5]
Length: 4, dtype: int64

In [9]:
#Data frame using Dictionary
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [12]:
# from list
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df

Unnamed: 0,Name,Age
0,Alex,10.0
1,Bob,12.0
2,Clarke,13.0


In [16]:
#With index
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
df

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [25]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.76515,-0.465253,-1.329309,-0.94424
2013-01-02,0.24513,-0.880565,-0.040945,0.041223
2013-01-03,1.226936,0.982413,0.279933,-1.568652
2013-01-04,1.047223,-1.185454,-0.586562,0.429533
2013-01-05,-0.195696,-1.198463,-0.515487,-1.355832
2013-01-06,-2.516546,-1.898245,0.522398,-0.522675


In [42]:
#Create a DataFrame from Dict of Series
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


## Selection

Use these commands to select a specific subset of your data.
- df[col] -- Returns column with label col as Series
- df[[col1, col2]] -- Returns columns as a new DataFrame
- s.iloc[0] -- Selection by position
- s.loc['index_one'] -- Selection by index
- df.iloc[0,:] -- First row
- df.iloc[0,0] -- First element of first column
- df.ix[index position,index label ] --is both Label and Integer based slicing technique.

-> loc gets rows (or columns) with particular labels from the index.<br>
s.loc[indexer]<br>
df.loc[row_indexer,column_indexer]<br>
-> iloc gets rows (or columns) at particular positions in the index (so it only takes integers).<br>
-> ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.<br>

In [12]:
#Create a DataFrame from Dict of Series
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

#Column Selection
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [13]:
#Column Addition
df['three']=pd.Series([10,20,30],index=['a','b','c'])#new one
df['four']=df['one']+df['three']#already with columns
df

Unnamed: 0,one,two,three,four
a,1.0,1,10.0,11.0
b,2.0,2,20.0,22.0
c,3.0,3,30.0,33.0
d,,4,,


In [14]:
#Column Deletion
del df['one']#del fun
df.pop('two')#with pop
df

Unnamed: 0,three,four
a,10.0,11.0
b,20.0,22.0
c,30.0,33.0
d,,


In [47]:
#Row Selection by Label
df.loc['b']

three    20.0
four     22.0
Name: b, dtype: float64

In [48]:
#Row Selection by position(integer) location
df.iloc[2]

three    30.0
four     33.0
Name: c, dtype: float64

In [49]:
#Row Slice
df[2:4]

Unnamed: 0,three,four
c,30.0,33.0
d,,


In [51]:
#Addition of Rows
data = [['Alex',10],['Bob',12],['Clarke',13]]
df2 = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df.append(df2)

Unnamed: 0,three,four,Name,Age
a,10.0,11.0,,
b,20.0,22.0,,
c,30.0,33.0,,
d,,,,
0,,,Alex,10.0
1,,,Bob,12.0
2,,,Clarke,13.0


In [54]:
#Deletion of Rows with index
df.drop('a')

Unnamed: 0,three,four
b,20.0,22.0
c,30.0,33.0
d,,


## Viewing/Inspecting Data

- df.head(n) -- First n rows of the DataFrame<br>
<b>Syntax</b> : DataFrame.head(n)
- df.tail(n) --Last n rows of the DataFrame<br>
<b>Syntax</b> : DataFrame.tail(n)
- df.shape -- Number of rows and columns<br>
<b>Syntax</b> : DataFrame.shape<br>
- df.info() -- Index, Datatype and Memory information<br>
<b>Syntax</b> : DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
- df.describe() -- Summary statistics for numerical columns<br>
<b>Syntax</b> : DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
- s.value_counts(dropna=False) -- View unique values and counts<br>
<b>Syntax</b> : Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)[
-  df.apply(pd.Series.value_counts) -- Unique values and counts for all columns<br>
<b>Syntax</b> :  DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
- df.to_numpy() -- converting dataframe to numpy array <br>
<b>Syntax</b> :DataFrame.to_numpy(dtype=None, copy=False, na_value=<object object>)
- df.sort_index()-- sorting by index<br>
<b>Syntax</b> : DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
- df.sort_values()-- sorting by values<br>
<b>Syntax</b> :DataFrame.sort_values(return_indexer=False, ascending=True, key=None)[source]

In [63]:
#entire
df.head()

Unnamed: 0,three,four
a,10.0,11.0
b,20.0,22.0
c,30.0,33.0
d,,


In [64]:
#first 2
df.head(2)

Unnamed: 0,three,four
a,10.0,11.0
b,20.0,22.0


In [66]:
#first 3
df.tail(3)

Unnamed: 0,three,four
b,20.0,22.0
c,30.0,33.0
d,,


In [68]:
#rows and columns
df.shape

(4, 2)

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   three   3 non-null      float64
 1   four    3 non-null      float64
dtypes: float64(2)
memory usage: 256.0+ bytes


In [71]:
df.describe()

Unnamed: 0,three,four
count,3.0,3.0
mean,20.0,22.0
std,10.0,11.0
min,10.0,11.0
25%,15.0,16.5
50%,20.0,22.0
75%,25.0,27.5
max,30.0,33.0


In [74]:
#NumPy representation and is fast and doesn’t require copying data. its a raw data
df.to_numpy()  #or np.asarray(df)

array([[10., 11.],
       [20., 22.],
       [30., 33.],
       [nan, nan]])

In [75]:
#transpose
df.T

Unnamed: 0,a,b,c,d
three,10.0,20.0,30.0,
four,11.0,22.0,33.0,


In [82]:
#Sorting data by an axis
df.sort_index(axis=1, ascending=False)

Unnamed: 0,three,four
a,10.0,11.0
b,20.0,22.0
c,30.0,33.0
d,,


In [81]:
#Sorting by values
df.sort_values(by='four')

Unnamed: 0,three,four
a,10.0,11.0
b,20.0,22.0
c,30.0,33.0
d,,


In [4]:
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()

3.0    2
4.0    1
2.0    1
1.0    1
dtype: int64