<a href="https://colab.research.google.com/github/aryadeo/pandas/blob/master/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Basics of Pandas
![Pandas Logo](https://drive.google.com/uc?id=1fn6w_q1O1jvFWQB6JlDQVKa6KG7_BG5A)

Date: 12/11/2019

pandas has two primary data structure.
1. Series(for 1D data)
2. DataFrame(for 2D data)

Data Processing can be done in three ways.
1. Data munging and cleaning
2. Analyzing and modelling the data
3. Organizing for sulitable ploting and visualization

In [1]:
!pip install pandas



In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#Data Structure


##Series-1D 


A series can be buit from


*   np array
*   dictionary
*   scalar

In [3]:
#series created with auto indexing
s_test_1=pd.Series(np.random.rand(10))
s_test_1

0    0.171077
1    0.644541
2    0.731960
3    0.394671
4    0.951099
5    0.167603
6    0.736815
7    0.398214
8    0.789251
9    0.565660
dtype: float64

In [4]:
#series from numpy with defined indices
s_test_2=pd.Series(np.random.randn(10), index=list('ABCDEFGHIJ'))
s_test_2

A   -2.914286
B    0.742177
C    0.106850
D    0.043703
E    0.440833
F   -1.093644
G   -0.984647
H   -0.580268
I   -1.827252
J    0.197933
dtype: float64

In [5]:
#series from scalar with defined indices
s_test_3=pd.Series(55.55,index=[m for m in range(5)])
s_test_3

0    55.55
1    55.55
2    55.55
3    55.55
4    55.55
dtype: float64

In [6]:
#Series in dict form
s_test_4=pd.Series({'a':1,'b':2})
s_test_4

a    1
b    2
dtype: int64

operations in a series


In [7]:
print(s_test_1.index)
print(s_test_2.index)
print(s_test_3.index)
#it can be seen that auto indexing and defined indices are printed in different ways.

RangeIndex(start=0, stop=10, step=1)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')
Int64Index([0, 1, 2, 3, 4], dtype='int64')


In [8]:
#series is similar to nD array/numpy array
print(s_test_1[0])
print(s_test_1[2:])
print(s_test_1[:5])
print(s_test_1[:])
print(s_test_1[s_test_1 > s_test_1.mean()])

0.1710769500278454
2    0.731960
3    0.394671
4    0.951099
5    0.167603
6    0.736815
7    0.398214
8    0.789251
9    0.565660
dtype: float64
0    0.171077
1    0.644541
2    0.731960
3    0.394671
4    0.951099
dtype: float64
0    0.171077
1    0.644541
2    0.731960
3    0.394671
4    0.951099
5    0.167603
6    0.736815
7    0.398214
8    0.789251
9    0.565660
dtype: float64
1    0.644541
2    0.731960
4    0.951099
6    0.736815
8    0.789251
9    0.565660
dtype: float64


In [9]:
#to know the datatype of a series
s_test_1.dtype

dtype('float64')

In [10]:
#converting series to array----should be approached everytime
s_test_1.to_numpy()

array([0.17107695, 0.64454101, 0.73195964, 0.39467148, 0.95109922,
       0.16760268, 0.73681526, 0.39821441, 0.78925079, 0.56566002])

In [11]:
#alternative way to get the array backing the series
s_test_1.array

<PandasArray>
[ 0.1710769500278454,  0.6445410146820336,   0.731959638176053,
 0.39467147821850057,  0.9510992239308719, 0.16760267995197298,
  0.7368152560641101,  0.3982144098322967,  0.7892507934617525,
  0.5656600164773486]
Length: 10, dtype: float64

In [12]:
s_test_1+s_test_1

0    0.342154
1    1.289082
2    1.463919
3    0.789343
4    1.902198
5    0.335205
6    1.473631
7    0.796429
8    1.578502
9    1.131320
dtype: float64

In [13]:
np.exp(s_test_1)

0    1.186582
1    1.905112
2    2.079151
3    1.483897
4    2.588553
5    1.182467
6    2.089271
7    1.489163
8    2.201746
9    1.760609
dtype: float64

In [14]:
#dictionary operation  in series
print(s_test_2['B'])
print(s_test_2['C'])
print('K'in s_test_2)
print('F'in s_test_2)
print(s_test_2['D'])

0.742177236547588
0.10685020051376777
False
True
0.04370261134018886


In [0]:
#giving a name to the series
s_test_1.name='XYZ'

In [16]:
print(s_test_1.name)

XYZ


In [0]:
s_test_1_new=s_test_1.rename('ABC')

In [18]:
s_test_1_new

0    0.171077
1    0.644541
2    0.731960
3    0.394671
4    0.951099
5    0.167603
6    0.736815
7    0.398214
8    0.789251
9    0.565660
Name: ABC, dtype: float64

In [19]:
s_test_1

0    0.171077
1    0.644541
2    0.731960
3    0.394671
4    0.951099
5    0.167603
6    0.736815
7    0.398214
8    0.789251
9    0.565660
Name: XYZ, dtype: float64

##Data Frame

Data Frames are 2D data structure with columns are f different/same datatypes.
It takes data in the form of dictionary/ array of numbers/ series/ another data frame.

###From dict of series

In [20]:
#the dataframe creates NaN for indices not defined in a series.
test_data={'one': pd.Series([10.,9,3,2],index=['a','b','c','d']),
           'two':pd.Series(np.random.rand(10),index=[m for m in range(10)])}
df_test=pd.DataFrame(test_data)
print(df_test)
#here you can see that for a,b,c,d indices the second column is NaN and for remaning indices the first column is showing NaN. 

    one       two
a  10.0       NaN
b   9.0       NaN
c   3.0       NaN
d   2.0       NaN
0   NaN  0.718794
1   NaN  0.390537
2   NaN  0.443889
3   NaN  0.962315
4   NaN  0.239685
5   NaN  0.745077
6   NaN  0.514985
7   NaN  0.437120
8   NaN  0.365648
9   NaN  0.998187


In [21]:
test_data_1={'A': pd.Series(np.random.rand(10),index=[m for m in range(10)]),
             'B': pd.Series([1,2,3,4,5,6,7,8,9,10])}
df_test_1=pd.DataFrame(test_data_1)
print(df_test_1)

          A   B
0  0.146206   1
1  0.234788   2
2  0.061640   3
3  0.825357   4
4  0.999384   5
5  0.031030   6
6  0.086095   7
7  0.486120   8
8  0.767250   9
9  0.784934  10


###From dict of array

In [22]:
test_data_2={'column_1':[11.,22.,33.,44.,55.],
             'column_2':[1.,2.,3.,4.,5.]}
df_test_2=pd.DataFrame(test_data_2)
print(df_test_2)

   column_1  column_2
0      11.0       1.0
1      22.0       2.0
2      33.0       3.0
3      44.0       4.0
4      55.0       5.0


In [23]:
#this data frame has no index. We can inser index to every row.
pd.DataFrame(df_test_2,index=['a','b','c','d','e'])
#Here it is searching for the given index in an already stored dataframe df_test_2

Unnamed: 0,column_1,column_2
a,,
b,,
c,,
d,,
e,,


In [24]:
#Now lets check the same external indexing by without storing a dataframe into a variable.
d = {'one': [1., 2., 3., 4.],
     'two': [4, 30, 2, 1]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4
1,2.0,30
2,3.0,2
3,4.0,1


In [25]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
#it can be seen that as the dataframe is not stored to a variable, it index names are getting changed.

Unnamed: 0,one,two
a,1.0,4
b,2.0,30
c,3.0,2
d,4.0,1


###From list of dicts

In [26]:
test_list=[{'one':1,'two':2,'three':3},{'a':4,'b':5,'one':10}]
pd.DataFrame(test_list)

Unnamed: 0,one,two,three,a,b
0,1,2.0,3.0,,
1,10,,,4.0,5.0


###other operations

In [27]:
#adding index to te dataframe
df_test_3=pd.DataFrame(test_list,index=['first set','second set'])
print(df_test_3)

            one  two  three    a    b
first set     1  2.0    3.0  NaN  NaN
second set   10  NaN    NaN  4.0  5.0


In [28]:
#to extract a particular column
pd.DataFrame(test_list,columns=['one','two'])

Unnamed: 0,one,two
0,1,2.0
1,10,


In [29]:
#extracting a particular index and columns in a dataframe.
print(df_test_3[:1]['one'])

first set    1
Name: one, dtype: int64


In [30]:
#deleting and popping up
del df_test_3['one']
print(df_test_3)

            two  three    a    b
first set   2.0    3.0  NaN  NaN
second set  NaN    NaN  4.0  5.0


In [31]:
#deleting and popping up
df_test_3.pop('b')
print(df_test_3)

            two  three    a
first set   2.0    3.0  NaN
second set  NaN    NaN  4.0


In [32]:
#When inserting a scalar value, it will naturally be propagated to fill the column:
df_test_3['new_column']='any scalar value'
print(df_test_3)

            two  three    a        new_column
first set   2.0    3.0  NaN  any scalar value
second set  NaN    NaN  4.0  any scalar value


by default new column will be inserted to the end of the dataframe. But by using 'insert' command we can insert a column at any position in a dataframe.

In [33]:
df_test_3['n_c_1']=df_test_3['two']
print(df_test_3)
#here we are adding a new column to the existing dataframe with values copied from the 'two' column. 

            two  three    a        new_column  n_c_1
first set   2.0    3.0  NaN  any scalar value    2.0
second set  NaN    NaN  4.0  any scalar value    NaN


In [34]:
df_test_3.insert(3,'inserted_column',[300,400])
print(df_test_3)

            two  three    a  inserted_column        new_column  n_c_1
first set   2.0    3.0  NaN              300  any scalar value    2.0
second set  NaN    NaN  4.0              400  any scalar value    NaN
