NumPy Introduction:
-----------------------------------
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of multidimensional array objects and a collection of routines for processing of array.

Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was also developed, having some additional functionalities. In 2005, Travis Oliphant created NumPy package by incorporating the features of Numarray into Numeric package. There are many contributors to this open source project.
(Source: https://www.tutorialspoint.com/numpy/numpy_introduction.htm)

Using NumPy, a developer can perform the following operations −

-- Mathematical and logical operations on arrays.

-- Fourier transforms and routines for shape manipulation.

-- Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.


In [1]:
#importing NumPy
import numpy as np

In [None]:
#!pip install numpy
#conda install numpy

ndarray
---------------------
The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index.

Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an object of data-type object (called dtype).

Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.

In [4]:
import numpy as np 
a = np.array([1,2,3,4,5]) 
print a
#a.shape-------------
type(a)
#len(a)
#a[1]

[1 2 3 4 5]


numpy.ndarray

In [4]:
a = np.array([[1, 2], [3, 4], [6, 7]]) 
print a

type(a)
#len(a)
a.shape
#a.size
#a[1:2,0:1]

[[1 2]
 [3 4]
 [6 7]]


6

In [16]:
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype = float)
print a

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]


In [30]:
# ndarray.shape

import numpy as np 
a = np.array([[1,2,3],[4,5,6]]) 
print a.shape

(2L, 3L)


In [21]:
import numpy as np 

a = np.array([[1,2,3],[4,5,6]]) 
print a
a.shape

a.shape = (3,2) 
#a.shape
print a 

[[1 2 3]
 [4 5 6]]
[[1 2]
 [3 4]
 [5 6]]


In [39]:
#NumPy also provides a reshape function to resize an array.
import numpy as np 
a = np.array([[1,2,3],[4,5,6]]) 
print a
b = a.reshape(3,2) 
print b

[[1 2 3]
 [4 5 6]]
[[1 2]
 [3 4]
 [5 6]]


In [72]:
# numpy.arange --> This function returns an ndarray object containing evenly spaced values within a given range.
#numpy.arange(start, stop, step, dtype)
import numpy as np 
x = np.arange(5) 
print x

[0 1 2 3 4]


In [73]:
import numpy as np 
# dtype set 
x = np.arange(5, dtype = float)
print x

[0. 1. 2. 3. 4.]


In [74]:
# start and stop parameters set 
import numpy as np 
x = np.arange(10,20,2) 
print x

[10 12 14 16 18]


In [75]:
#numpy.linspace --> This function is similar to arange() function. In this function, instead of step size, 
#the number of evenly spaced values between the interval is specified. 
#numpy.linspace(start, stop, num, endpoint, retstep, dtype)

import numpy as np 
x = np.linspace(10,20,5) 
print x

[10.  12.5 15.  17.5 20. ]


In [22]:
# endpoint set to false 
x = np.linspace(10,20, 5, endpoint = False) 
print x

[10. 12. 14. 16. 18.]


In [26]:
#ndarray.ndim
#This array attribute returns the number of array dimensions.
import numpy as np 
a = np.arange(24) #numpy.arrange() is used to return evenly spaced numbers
print a
a.ndim
#len(a)

#b=a.reshape(2,7,3)
b=a.reshape(2,3,4)
print b
b.ndim

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]


3

In [50]:
#numpy.itemsize --> This array attribute returns the length of each element of array in bytes.
# dtype of array is int8 (1 byte) 
import numpy as np 
x = np.array([1,2,3,4,5], dtype = np.int8) 
print x.itemsize

1


In [51]:
# dtype of array is now float32 (4 bytes) 
import numpy as np 
x = np.array([1,2,3,4,5], dtype = np.float32) 
print x.itemsize

4


In [5]:
#Converting from one type to another
abc=np.array([[1.2, 2.2, 3.2],[4.1, 5.5, 6.1]])
abc
abc.dtype
abc.itemsize
#abc1 = abc.astype('float32')
#abc1.dtype
#abc1.itemsize

8

In [9]:
#numpy.empty  --> It creates an uninitialized array of specified shape and dtype. It uses the following constructor
# numpy.empty(shape, dtype = float, order = 'C')
import numpy as np 
x = np.empty([3,2], dtype = int) 
print x
#The elements in an array show random values as they are not initialized.

[[30933434        0]
 [55098504        0]
 [31401744        0]]


In [10]:
# numpy.zeros --> Returns a new array of specified size, filled with zeros.
#numpy.zeros(shape, dtype = float, order = 'C')
import numpy as np 
x = np.zeros(5) 
print x
#y = np.zeros((5,), dtype=np.int)
#print y

[0. 0. 0. 0. 0.]


In [11]:
# numpy.ones --> Returns a new array of specified size and type, filled with ones.
import numpy as np 
x = np.ones(5) 
print x

y = np.ones([2,2], dtype = int) 
print y

[1. 1. 1. 1. 1.]
[[1 1]
 [1 1]]


In [70]:
# numpy.asarray--> This function is similar to numpy.array except for the fact that it has fewer parameters.
import numpy as np 

x = [1,2,3] 
type (x)
#a = np.asarray(x) 
#type (a)
#print a

list

In [71]:
import numpy as np 

x = [1,2,3]
a = np.asarray(x, dtype = float) 
print a

[1. 2. 3.]


In [78]:
#Indexing, Slicing and Dicing
import numpy as np 
a = np.arange(10) 
print a
#A Python slice object is constructed by giving start, stop, and step parameters to the built-in slice function. 
#This slice object is passed to the array to extract a part of array.
#s = slice(2,7,2) 
#print a[s]

[0 1 2 3 4 5 6 7 8 9]


In [79]:
#The same result can also be obtained by giving the slicing parameters separated by a colon : (start:stop:step) 
#directly to the ndarray object.
import numpy as np 
a = np.arange(10) 
b = a[2:8:2] 
print b

[2 4 6]


In [82]:
# slice single item 
a = np.arange(10) 
print a
b = a[3] 
print b

[0 1 2 3 4 5 6 7 8 9]
3


In [84]:
# slice items starting from index 
a = np.arange(15) 
print a
print a[10:]

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[10 11 12 13 14]


In [85]:
# slice items between indexes 
a = np.arange(20) 
print a[10:15]

[10 11 12 13 14]


In [88]:
a = np.array([[1,2,3],[3,4,5],[4,5,6]]) 
print a  

# slice items starting from index
#items in 2nd and 3rd rows
print a[1:]

[[1 2 3]
 [3 4 5]
 [4 5 6]]
[[3 4 5]
 [4 5 6]]


In [90]:
#Slice the array from 2nd column to 3rd column
a[:,1:3]

array([[2],
       [4],
       [5]])

In [94]:
#Slice the array from 2nd column to 3rd and from 2nd row to 3rd row.
a[1:3, 1:3]

array([[4, 5],
       [5, 6]])

In [100]:
# Items in second column
a[:, 1]
#a[...,1]

array([2, 4, 5])

In [99]:
#items in second row
a[1,...]

array([3, 4, 5])

In [103]:
a[a>4]

array([5, 5, 6])

In [105]:
a = np.array([np.nan, 1,2,np.nan,3,4,5]) #nan--> not a number
a
#print a[~np.isnan(a)]

array([nan,  1.,  2., nan,  3.,  4.,  5.])

In [12]:
a = np.array([1, 2+6j, 5, 3.5+5j]) 
print a[np.iscomplex(a)]

[2. +6.j 3.5+5.j]


In [104]:
# ndarray.flatten(order)
a = np.arange(8).reshape(2,4) 
print a
#print a.flatten() 
#print a.flatten(order = 'F') # flatten in column major way
#print a.flatten(order = 'C') # Fatten in row major way

[[0 1 2 3]
 [4 5 6 7]]


In [113]:
# numpy.transpose(arr, axes)
print a
np.transpose(a)

[[0 1 2 3]
 [4 5 6 7]]


array([[0, 4],
       [1, 5],
       [2, 6],
       [3, 7]])

In [106]:
# Working with Statistical Functions

a = np.array([[3,7,5,6],[8,4,3,9],[2,4,9,11]]) 
print a

[[ 3  7  5  6]
 [ 8  4  3  9]
 [ 2  4  9 11]]


In [115]:
np.max(a)

11

In [117]:
np.min(a)

2

In [118]:
np.min(a, axis = 1) #along the rows   #axis = 1 means rows. axis=0 for columns

array([3, 3, 2])

In [119]:
np.min(a, axis = 0) #along the columns

array([2, 4, 3, 6])

In [120]:
np.max(a, axis = 0) #along the columns

array([ 8,  7,  9, 11])

In [122]:
#Finding minimum element in 2nd row
np.min(a[1,:]) 

3

In [107]:
print a

[[ 3  7  5  6]
 [ 8  4  3  9]
 [ 2  4  9 11]]


In [123]:
#numpy.median() --> Median is defined as the value separating the higher half of a data sample from the lower half.

np.median(a)

5.5

In [124]:
#Along the columns (Axis=0)
np.median(a, axis = 0) 

array([3., 4., 5., 9.])

In [125]:
#Along the rows (Axis=1)
np.median(a, axis = 1) 

array([5.5, 6. , 6.5])

In [126]:
# numpy.mean()
np.mean(a) 

5.916666666666667

In [128]:
print np.mean(a, axis = 0)
print np.mean(a, axis = 1)

[4.33333333 5.         5.66666667 8.66666667]
[5.25 6.   6.5 ]


In [134]:
a1= a.flatten()
print a1
a2=np.unique(a1)
print a2

[ 3  7  5  6  8  4  3  9  2  4  9 11]
[ 2  3  4  5  6  7  8  9 11]


In [135]:
#Standard deviation 
np.std(a2)

2.7666443551086073

In [136]:
#Variance
np.var(a2)

7.65432098765432

In [143]:
#percentile
# numpy.percentile(a, q, axis)
print np.percentile(a2, 50)
print np.median(a2)

6.0
6.0


In [147]:
#Arithmetic Operations
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
print a
b=np.array([[1,2,3]])
print b

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 2 3]]


In [158]:
print np.add(a,b)
print np.subtract(a,b)
print np.multiply(a,b)
print np.divide(a,b)
print np.power(b,2)
print np.mod(10,6) # to get the remainder

[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]
[[0 0 0]
 [3 3 3]
 [6 6 6]]
[[ 1  4  9]
 [ 4 10 18]
 [ 7 16 27]]
[[1 1 1]
 [4 2 2]
 [7 4 3]]
[[1 4 9]]
4


In [180]:
# Reading a file using Python
myFile = open("C://Users//user//Desktop//Sample File for testing.txt","r")
x = myFile.read()
print x

Introduction to Big Data; Data structures in python
Working with Numpy and Pandas: Data Exploration



In [182]:
# Writing in a file using Python
myFile = open("C://Users//user//Desktop//Sample File for testing.txt","w")
myFile.write("We have learnt a lot today")
myFile.close()

In [183]:
myFile = open("C://Users//user//Desktop//Sample File for testing.txt","a")
myFile.write("Python is easy")
myFile.close()

Introducing Pandas
---------------------------------------
pandas is a software library written for the Python programming language for data manipulation and analysis. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.
In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze. 

Pandas deals with the following three data structures −
-- Series
-- DataFrame
-- Panel

These data structures are built on top of Numpy array, which means they are fast.

Link: https://www.tutorialspoint.com/python_pandas/python_pandas_introduction_to_data_structures.htm

Pandas.Series(): https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm
Pandas.Panel(): https://www.tutorialspoint.com/python_pandas/python_pandas_panel.htm

Pandas.DataFrame(): https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

Python DataFrame
-----------------------------------------

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features of DataFrame
-- Potentially columns are of different types
-- Size – Mutable
-- Labeled axes (rows and columns)
-- Can Perform Arithmetic operations on rows and columns

pandas.DataFrame( data, index, columns, dtype, copy)

A pandas DataFrame can be created using various inputs like −
-- Lists
-- dict
-- Series
-- Numpy ndarrays
-- Another DataFrame


In [None]:
!pip install pandas

In [None]:
conda install pandas

In [14]:
import pandas as pd

In [15]:
import pandas as pd
df = pd.DataFrame()
print df

Empty DataFrame
Columns: []
Index: []


In [3]:
data = [1,2,3,4,5,6,7,8,9,10]
print type(data)
df = pd.DataFrame(data)
print df

<type 'list'>
    0
0   1
1   2
2   3
3   4
4   5
5   6
6   7
7   8
8   9
9  10


In [16]:
data = [['Rohit', 13],['Rahul',5],['Kohli',31],['Dhawan', 10]]
print data
print type(data)
df = pd.DataFrame(data,columns=['Name','Centuries'])
print df
print df['Name']

[['Rohit', 13], ['Rahul', 5], ['Kohli', 31], ['Dhawan', 10]]
<type 'list'>
     Name  Centuries
0   Rohit         13
1   Rahul          5
2   Kohli         31
3  Dhawan         10
0     Rohit
1     Rahul
2     Kohli
3    Dhawan
Name: Name, dtype: object


In [17]:
data = {'Name':['Rohit', 'Rahul', 'Kohli', 'Dhawan'],'Centuries':[10,5,31,10],'Age':[32,36,31,34]}
print data
print type(data)
df = pd.DataFrame(data)
print df
df['Age']

{'Age': [32, 36, 31, 34], 'Name': ['Rohit', 'Rahul', 'Kohli', 'Dhawan'], 'Centuries': [10, 5, 31, 10]}
<type 'dict'>
   Age  Centuries    Name
0   32         10   Rohit
1   36          5   Rahul
2   31         31   Kohli
3   34         10  Dhawan


0    32
1    36
2    31
3    34
Name: Age, dtype: int64

In [18]:
data = {'Name':['Virat', 'Rohit', 'Rahane', 'Pujara'],'Age':[32,32,31,33]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

#print the data for players whose age>32
#df[]
print df['Age']>32
print df[df['Age']>32]

       Age    Name
rank1   32   Virat
rank2   32   Rohit
rank3   31  Rahane
rank4   33  Pujara
rank1    False
rank2    False
rank3    False
rank4     True
Name: Age, dtype: bool
       Age    Name
rank4   33  Pujara


In [12]:
#Creating dataFrame from list of Dictionary

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)
#print df

#df1 = pd.DataFrame(data, columns=['a', 'b'])
#df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#print df1

#df1 = pd.DataFrame(data, columns=['a', 'b1'])
#df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
print df

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [13]:
Ser = pd.Series(range(9))
print Ser
type(Ser)
df = pd.DataFrame(Ser)
print df

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int64
   0
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8


In [14]:
#Creating a DataFrame from Dict of Series
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}
print d

{'three': a    10
b    20
c    30
dtype: int64, 'two': a    1
b    2
c    3
d    4
dtype: int64, 'one': a    1
b    2
c    3
dtype: int64}


In [15]:
df = pd.DataFrame(d)
print ("DataFrame")
print df

DataFrame
   one  three  two
a  1.0   10.0    1
b  2.0   20.0    2
c  3.0   30.0    3
d  NaN    NaN    4


In [16]:
#Accessing a column
print df['one']

# Adding columns
print ("")
print ("Column Addition")
df['four']=pd.Series([20,30,40,70],index=['a','b','c','d'])
print df
df['five']=df['three']+df['four']
print df


a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

Column Addition
   one  three  two  four
a  1.0   10.0    1    20
b  2.0   20.0    2    30
c  3.0   30.0    3    40
d  NaN    NaN    4    70
   one  three  two  four  five
a  1.0   10.0    1    20  30.0
b  2.0   20.0    2    30  50.0
c  3.0   30.0    3    40  70.0
d  NaN    NaN    4    70   NaN


In [31]:
# Deleting a column
print ("deleting a column")
#del df['one']
print df

df.pop('two')
print df


deleting a column
   three  two  four  five
a   10.0    1    20  30.0
b   20.0    2    30  50.0
c   30.0    3    40  70.0
d    NaN    4    70   NaN
   three  four  five
a   10.0    20  30.0
b   20.0    30  50.0
c   30.0    40  70.0
d    NaN    70   NaN


In [18]:
df

Unnamed: 0,one,three,two,four,five
a,1.0,10.0,1,20,30.0
b,2.0,20.0,2,30,50.0
c,3.0,30.0,3,40,70.0
d,,,4,70,


In [19]:
df.isna().sum()

one      1
three    1
two      0
four     0
five     1
dtype: int64

In [None]:
#Data issues: Missing data, Duplicate Values
# delete that row if the missing value counts are very less
# if it is more, you need to impute / fill in those cells with some values

In [37]:
df.dropna(inplace=True, axis=0) #axis value 0, means its a row. axis value 1 means its a column

In [47]:
df

Unnamed: 0,one,three,two,four,five
a,1.0,10.0,1,20,30.0
b,2.0,20.0,2,30,50.0
c,3.0,30.0,3,40,70.0
d,,,4,70,


In [20]:
a = df['three'].mean()
print a

20.0


In [21]:
df['three'].fillna(value=a, inplace=True)

In [22]:
df

Unnamed: 0,one,three,two,four,five
a,1.0,10.0,1,20,30.0
b,2.0,20.0,2,30,50.0
c,3.0,30.0,3,40,70.0
d,,20.0,4,70,


In [23]:
df=df.fillna(999)

In [39]:
df

Unnamed: 0,one,three,two,four,five
a,1.0,10.0,1,20,30.0
b,2.0,20.0,2,30,50.0
c,3.0,30.0,3,40,70.0
d,999.0,20.0,4,70,999.0


In [41]:
df10 = df #creating an instance
print (df10)
print type(df10)

     one  three  two  four   five
a    1.0   10.0    1    20   30.0
b    2.0   20.0    2    30   50.0
c    3.0   30.0    3    40   70.0
d  999.0   20.0    4    70  999.0
<class 'pandas.core.frame.DataFrame'>


In [42]:
df['six']=df['five']+df['four']
print (df)

     one  three  two  four   five     six
a    1.0   10.0    1    20   30.0    50.0
b    2.0   20.0    2    30   50.0    80.0
c    3.0   30.0    3    40   70.0   110.0
d  999.0   20.0    4    70  999.0  1069.0


In [43]:
print (df10)

     one  three  two  four   five     six
a    1.0   10.0    1    20   30.0    50.0
b    2.0   20.0    2    30   50.0    80.0
c    3.0   30.0    3    40   70.0   110.0
d  999.0   20.0    4    70  999.0  1069.0


In [44]:
df11 = df.copy() #df11 is a separate entity now. It is not an instance of df

In [45]:
print df11

     one  three  two  four   five     six
a    1.0   10.0    1    20   30.0    50.0
b    2.0   20.0    2    30   50.0    80.0
c    3.0   30.0    3    40   70.0   110.0
d  999.0   20.0    4    70  999.0  1069.0


In [46]:
df['seven'] = df['four']+df['five']+df['six']
print df

     one  three  two  four   five     six   seven
a    1.0   10.0    1    20   30.0    50.0   100.0
b    2.0   20.0    2    30   50.0    80.0   160.0
c    3.0   30.0    3    40   70.0   110.0   220.0
d  999.0   20.0    4    70  999.0  1069.0  2138.0


In [47]:
df11

Unnamed: 0,one,three,two,four,five,six
a,1.0,10.0,1,20,30.0,50.0
b,2.0,20.0,2,30,50.0,80.0
c,3.0,30.0,3,40,70.0,110.0
d,999.0,20.0,4,70,999.0,1069.0


In [53]:
abc = pd.DataFrame({'eight' : pd.Series([10,20,30], index=['a','b','c'])})
print type(abc)
print abc
print type(df)
print df
print ("Combining Dataframes")
#combining two dataframes
#test = abc+df
#print test
#print type(test)
f = [abc,df]
type(f)
new_df = pd.concat(f)
print (new_df)

new_df1 = pd.concat(f, axis=1)
print new_df1

#Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

<class 'pandas.core.frame.DataFrame'>
   eight
a     10
b     20
c     30
<class 'pandas.core.frame.DataFrame'>
     one  three  two  four   five     six   seven
a    1.0   10.0    1    20   30.0    50.0   100.0
b    2.0   20.0    2    30   50.0    80.0   160.0
c    3.0   30.0    3    40   70.0   110.0   220.0
d  999.0   20.0    4    70  999.0  1069.0  2138.0
Combining Dataframes
   eight   five  four    one   seven     six  three  two
a   10.0    NaN   NaN    NaN     NaN     NaN    NaN  NaN
b   20.0    NaN   NaN    NaN     NaN     NaN    NaN  NaN
c   30.0    NaN   NaN    NaN     NaN     NaN    NaN  NaN
a    NaN   30.0  20.0    1.0   100.0    50.0   10.0  1.0
b    NaN   50.0  30.0    2.0   160.0    80.0   20.0  2.0
c    NaN   70.0  40.0    3.0   220.0   110.0   30.0  3.0
d    NaN  999.0  70.0  999.0  2138.0  1069.0   20.0  4.0
   eight    one  three  two  four   five     six   seven
a   10.0    1.0   10.0    1    20   30.0    50.0   100.0
b   20.0    2.0   20.0    2    30   50.0    80.

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  del sys.path[0]
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  app.launch_new_instance()


In [53]:
df.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

In [56]:
#Extract first two rows
df.iloc[0:2, :]

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0


In [61]:
#Extract the first column
df.iloc[:,0:3]
df.iloc[:,[0,2,4]]
#df['one']

Unnamed: 0,one,two,five
a,1.0,1,30.0
b,2.0,2,50.0
c,3.0,3,70.0
d,999.0,4,999.0


In [125]:
#Extract columns 1 and 3
df.iloc[:,[0,2]]
#df[['one','two']]

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,0.0,4


In [62]:
#Extract rows 1 and 3
df.iloc[[0,2], :]

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
c,3.0,30.0,3,40,70.0,110.0,220.0


In [66]:
df.iloc[[1], [1]]

Unnamed: 0,three
b,20.0


In [67]:
df.iloc[[1,2],[2,3]]

Unnamed: 0,two,four
b,2,30
c,3,40


In [69]:
df.iloc[1:3,2:4]

Unnamed: 0,two,four
b,2,30
c,3,40


In [123]:
df

Unnamed: 0,one,three,two,four,five
a,1.0,10.0,1,20,30.0
b,2.0,20.0,2,30,50.0
c,3.0,30.0,3,40,70.0
d,0.0,20.0,4,70,0.0


In [72]:
#Print all rows where column 3 has values >=30
df[df['three']>=30]

Unnamed: 0,one,three,two,four,five,six,seven
c,3.0,30.0,3,40,70.0,110.0,220.0


In [70]:
df.three>=30

a    False
b    False
c     True
d    False
Name: three, dtype: bool

In [73]:
df[['one','two','six']]

Unnamed: 0,one,two,six
a,1.0,1,50.0
b,2.0,2,80.0
c,3.0,3,110.0
d,999.0,4,1069.0


In [130]:
df.index.values

array(['a', 'b', 'c', 'd'], dtype=object)

In [74]:
df.loc[['a','c'],:]

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
c,3.0,30.0,3,40,70.0,110.0,220.0


In [75]:
df.loc[:,['one','three']]

Unnamed: 0,one,three
a,1.0,10.0
b,2.0,20.0
c,3.0,30.0
d,999.0,20.0


In [76]:
df

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0
c,3.0,30.0,3,40,70.0,110.0,220.0
d,999.0,20.0,4,70,999.0,1069.0,2138.0


In [77]:
df.loc[['a','c'],['one','two']]

Unnamed: 0,one,two
a,1.0,1
c,3.0,3


In [79]:
#Adding a row at the end
df.loc['e'] = [5, 89, 93,8,9, 15, 20] 

In [80]:
df

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0
c,3.0,30.0,3,40,70.0,110.0,220.0
d,999.0,20.0,4,70,999.0,1069.0,2138.0
e,5.0,89.0,93,8,9.0,15.0,20.0


In [81]:
df2 = {'one': 5, 'two': 93, 'three': 89, 'four': 8, 'five':9, 'six': 15, "seven": 20}
df = df.append(df2, ignore_index = True)

In [82]:
df

Unnamed: 0,one,three,two,four,five,six,seven
0,1.0,10.0,1,20,30.0,50.0,100.0
1,2.0,20.0,2,30,50.0,80.0,160.0
2,3.0,30.0,3,40,70.0,110.0,220.0
3,999.0,20.0,4,70,999.0,1069.0,2138.0
4,5.0,89.0,93,8,9.0,15.0,20.0
5,5.0,89.0,93,8,9.0,15.0,20.0


In [83]:
df.index = ['a','b','c','d','e','f']

In [84]:
df

Unnamed: 0,one,three,two,four,five,six,seven
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0
c,3.0,30.0,3,40,70.0,110.0,220.0
d,999.0,20.0,4,70,999.0,1069.0,2138.0
e,5.0,89.0,93,8,9.0,15.0,20.0
f,5.0,89.0,93,8,9.0,15.0,20.0


In [85]:
df.columns

Index([u'one', u'three', u'two', u'four', u'five', u'six', u'seven'], dtype='object')

In [86]:
df.columns.values

array(['one', 'three', 'two', 'four', 'five', 'six', 'seven'],
      dtype=object)

In [88]:
# Change the column names
df.columns =['Col_1', 'Col_2','Col_3','Col_4','Col_5', 'Col_6',"Col_7"]  

In [89]:
df

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0
c,3.0,30.0,3,40,70.0,110.0,220.0
d,999.0,20.0,4,70,999.0,1069.0,2138.0
e,5.0,89.0,93,8,9.0,15.0,20.0
f,5.0,89.0,93,8,9.0,15.0,20.0


In [92]:
df.duplicated(keep='last')

a    False
b    False
c    False
d    False
e     True
f    False
dtype: bool

In [93]:
df[df.duplicated(keep = 'last')] 

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7
e,5.0,89.0,93,8,9.0,15.0,20.0


In [94]:
df = df.drop_duplicates(subset = None, keep ='first')

In [95]:
df

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7
a,1.0,10.0,1,20,30.0,50.0,100.0
b,2.0,20.0,2,30,50.0,80.0,160.0
c,3.0,30.0,3,40,70.0,110.0,220.0
d,999.0,20.0,4,70,999.0,1069.0,2138.0
e,5.0,89.0,93,8,9.0,15.0,20.0


In [97]:
#to get rows where values in column three is not equal to 20
df[df['Col_2']!=20]
df[~(df['Col_2']==20)]

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7
a,1.0,10.0,1,20,30.0,50.0,100.0
c,3.0,30.0,3,40,70.0,110.0,220.0
e,5.0,89.0,93,8,9.0,15.0,20.0


loc gets rows (or columns) with particular labels from the index.

iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

Source: https://www.pythonprogramming.in/what-is-difference-between-iloc-and-loc-in-pandas.html

In [88]:
import pandas as pd
 
df1 = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])
 
print("\n -- loc -- \n")
print(df.loc[df1['Age'] < 30, ['Color', 'Height']])
 
print("\n -- iloc -- \n")
print(df.iloc[(df1['Age'] < 30).values, [1, 3]])


 -- loc -- 

           Color  Height
Nick       Green      70
Aaron        Red     120
Christina  Black     172

 -- iloc -- 

           Color  Height
Nick       Green      70
Aaron        Red     120
Christina  Black     172


In [89]:
(df1['Age'] < 30).values

array([False,  True,  True, False, False,  True, False])

In [90]:
df1['Age'] < 30

Jane         False
Nick          True
Aaron         True
Penelope     False
Dean         False
Christina     True
Cornelia     False
Name: Age, dtype: bool

In [55]:
#Accessing Excel Files using Pandas
import pandas as pd
data = pd.read_excel("C://Users//user//Desktop//FORE Documents//Courses//Big Data Analytics for Managers (Python)//Session 3- 6//archive//Sample Data.xlsx")

print data
print type(data)
print data['Name']
print ("Mean Age",data['Age'].mean())
print ("Median height",data['Height (cms)'].median())
print ("Players whose name starts with R",data['Name'][data['Name'].str.startswith("R")])

      Name  Age  Height (cms) Gender
0    Virat   32           162   Male
1    Rohit   32           162   Male
2  Shikhar   33           160   Male
3   Rahane   33           159   Male
4   Pujara   33           161   Male
<class 'pandas.core.frame.DataFrame'>
0      Virat
1      Rohit
2    Shikhar
3     Rahane
4     Pujara
Name: Name, dtype: object
('Mean Age', 32.6)
('Median height', 161.0)
('Players whose name starts with R', 1     Rohit
3    Rahane
Name: Name, dtype: object)


In [60]:
# reading csv files using pandas
data = pd.read_csv("C://Users//user//Desktop//FORE Documents//Courses//Big Data Analytics for Managers (Python)/Session 3- 6//archive//bank_cleaned.csv")
data.head() #First five rows
data.tail() #last five rows

Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,balance,housing,loan,day,month,duration,campaign,pdays,previous,poutcome,response,response_binary
40836,45205,25,technician,single,secondary,no,505,no,yes,17,nov,6.43,2,-1,0,unknown,yes,1
40837,45206,51,technician,married,tertiary,no,825,no,no,17,nov,16.28,3,-1,0,unknown,yes,1
40838,45207,71,retired,divorced,primary,no,1729,no,no,17,nov,7.6,2,-1,0,unknown,yes,1
40839,45208,72,retired,married,secondary,no,5715,no,no,17,nov,18.78,5,184,3,success,yes,1
40840,45209,57,blue-collar,married,secondary,no,668,no,no,17,nov,8.47,4,-1,0,unknown,no,0


In [98]:
# Group_by operation
#Source: 

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

In [99]:
df

Unnamed: 0,Points,Rank,Team,Year
0,876,1,Riders,2014
1,789,2,Riders,2015
2,863,2,Devils,2014
3,673,3,Devils,2015
4,741,3,Kings,2014
5,812,4,kings,2015
6,756,1,Kings,2016
7,788,1,Kings,2017
8,694,2,Riders,2016
9,701,4,Royals,2014


In [100]:
print df.groupby('Team')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000000000B67DEB8>


In [175]:
df.groupby('Team').groups

{'Devils': Int64Index([2, 3], dtype='int64'),
 'Kings': Int64Index([4, 6, 7], dtype='int64'),
 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
 'Royals': Int64Index([9, 10], dtype='int64'),
 'kings': Int64Index([5], dtype='int64')}

In [101]:
df.groupby(['Team','Year']).groups

{('Devils', 2014L): Int64Index([2], dtype='int64'),
 ('Devils', 2015L): Int64Index([3], dtype='int64'),
 ('Kings', 2014L): Int64Index([4], dtype='int64'),
 ('Kings', 2016L): Int64Index([6], dtype='int64'),
 ('Kings', 2017L): Int64Index([7], dtype='int64'),
 ('Riders', 2014L): Int64Index([0], dtype='int64'),
 ('Riders', 2015L): Int64Index([1], dtype='int64'),
 ('Riders', 2016L): Int64Index([8], dtype='int64'),
 ('Riders', 2017L): Int64Index([11], dtype='int64'),
 ('Royals', 2014L): Int64Index([9], dtype='int64'),
 ('Royals', 2015L): Int64Index([10], dtype='int64'),
 ('kings', 2015L): Int64Index([5], dtype='int64')}

In [102]:
import numpy as np
df.groupby('Year').agg(np.mean)

Unnamed: 0_level_0,Points,Rank
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,795.25,2.5
2015,769.5,2.5
2016,725.0,1.5
2017,739.0,1.5


In [180]:
df.groupby('Team').agg(np.mean)

Unnamed: 0_level_0,Points,Rank,Year
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,768.0,2.5,2014.5
Kings,761.666667,1.666667,2015.666667
Riders,762.25,1.75,2015.5
Royals,752.5,2.5,2014.5
kings,812.0,4.0,2015.0


In [181]:
df.groupby('Team')['Points'].agg(np.mean)

Team
Devils    768.000000
Kings     761.666667
Riders    762.250000
Royals    752.500000
kings     812.000000
Name: Points, dtype: float64