In [0]:
import pandas as pd
import numpy as np

In [8]:
s = pd.Series(np.arange(100, 110), index= np.arange(10,20))
s

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int64

In [9]:
s[1:6]

11    101
12    102
13    103
14    104
15    105
dtype: int64

In [10]:
s.iloc[1:6]

11    101
12    102
13    103
14    104
15    105
dtype: int64

In [11]:
s[1:6:2]

11    101
13    103
15    105
dtype: int64

Use of a negative step value will reverse the result. The following demonstrates how to reverse the Series:

In [20]:
s[::-1]

19    109
18    108
17    107
16    106
15    105
14    104
13    103
12    102
11    101
10    100
dtype: int64

Use of nnegative value in the column section removes the row from last 

In [13]:
s[:-1]

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
dtype: int64

To do subsetting od series using negative row and column index. We need to keep in mind that the index in row section should be bigger than the column index. 

The reason for this is because we need to make sure that the cursor traverses to a higher negative position and then using default steps comes down to the end 

In [15]:
s[-5:-3]

15    105
16    106
dtype: int64

With negative indexes, if we use a positive step we will get a row in an original order 

In [16]:
s[-8:-4:2]

12    102
14    104
dtype: int64

However, if we use a negative step we will ge an empty series. Thereason being the order of the series is reversed when we place a -ve value in step section. because of which the -8 becomes smaller than -4 

In [17]:
s[-8:-4:-2]

Series([], dtype: int64)

If we interchange the indexes we get a non emppty reverse ordered series

In [21]:
s[-4:-8:-2]

16    106
14    104
dtype: int64

In [19]:
s[-2::-2]

18    108
16    106
14    104
12    102
10    100
dtype: int64

It is also possible to slice a series with a non-integer index. To demonstrate, let's use the following Series:

In [23]:
s = pd.Series(np.arange(0,5), index= list('abcde'))
s

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [24]:
s[1:3]

b    1
c    2
dtype: int64

In [29]:
s.loc['b':'e':2]

b    1
d    3
dtype: int64

# Alignment Via Index Labels  

Becasue of presence of index in series of Python. Whenever we do some artihmatic or logical operation on two different series with same index labels. Pandas automatically matches the values of two same index values in both the serieses

In [0]:
s1 = pd.Series([22,4], index= ['a','b'])
s2 = s1 = pd.Series([33,5], index= ['a','b']) 

In [31]:
s1 + s2

a    66
b    10
dtype: int64

It is also possible to apply a scalar value to a Series. The result will be that the scalar will be applied to each value in the Series using the specified operation:

In [33]:
s1 * 2

a    66
b    10
dtype: int64

A similar result can be obtained by using the following steps 

In [34]:
t = pd.Series(2, index = s1.index)
t

a    2
b    2
dtype: int64

In [35]:
s1 * t

a    66
b    10
dtype: int64

The NaN value is, by default, the result of any pandas alignment where an index label does not align with the other Series. This is an important characteristic of pandas, when compared to NumPy. If labels do not align, there should not be an exception thrown. This helps when some data is missing but it is acceptable for this to happen. Processing continues, but pandas lets you know there's an issue (but not necessarily a problem) by returning NaN.

In [0]:
s4 = pd.Series([3,5], index= ['c','b'])

In [37]:
s1 + s4

a     NaN
b    10.0
c     NaN
dtype: float64

Labels in a pandas index do not need to be unique. The alignment operation actually forms a Cartesian product of the labels in the two Series. If there are n 'a' labels in series 1, and m labels in series 2, then the result will have n*m total rows in the result.

In [0]:
s1 = pd.Series([3,4,5,6], index= ['a','a','b','c'])
s2 = pd.Series([1,3,2,2], index= ['a','b','b','c'])

In [45]:
s1 * s2

a     3
a     4
b    15
b    10
c    12
dtype: int64

# Index subsetting 

In [47]:
s1[ s1 >2]

a    3
a    4
b    5
c    6
dtype: int64

In [48]:
s1 > 4

a    False
a    False
b     True
c     True
dtype: bool

In [49]:
s1[(s1 > 3) & (s1 < 6)]

a    4
b    5
dtype: int64

In [50]:
s1[s1 >3].all()

True

In [51]:
s1[s1> 8].any()

False

In [52]:
(s <4).sum()

4

# ReIndexing

In [0]:
s1 = pd.Series([1,2,3,4,5,6], index= ['a','b','c','d','e','f'])

In [56]:
s1.reindex(index= [1,2,3,4,5,6])

1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
dtype: float64

# DataFrame

There are a number of ways to create a data frame. A data frame can be created from either a single or multi-dimensional set of data. The techniques that we will examine are as follows:

* Using the results of NumPy functions
* Using data from a Python dictionary consisting of lists or pandas Series objects
* Using data from a CSV file

While examining each of these we will also examine how to specify column names, demonstrate how alignment is performed during initialization, and see how to determine the dimensions of a dataframe

In [57]:
pd.DataFrame(np.arange(1,5))

Unnamed: 0,0
0,1
1,2
2,3
3,4


The first column of the output shows the labels of the index that was created. Since an index was not specified at the time of creation, pandas created a based RangeIndex with labels starting at 0

In [0]:
df = pd.DataFrame(np.array([[3,4], [5,6]]))

In [60]:
df

Unnamed: 0,0,1
0,3,4
1,5,6


In [61]:
df.columns

RangeIndex(start=0, stop=2, step=1)

In [0]:
df = pd.DataFrame(np.array([[3,4], [5,6]]), columns = ['A', 'B'])

In [63]:
df

Unnamed: 0,A,B
0,3,4
1,5,6


In [64]:
len(df)

2

In [65]:
df.shape

(2, 2)


Creating a DataFrame using a Python dictionary and pandas Series objects


In [0]:
a = [3,4,5,6]
b = [5,6,7,8]

dicc = {'T1': a ,
       'T2' : b }

df1 = pd.DataFrame(dicc)

In [69]:
df1

Unnamed: 0,T1,T2
0,3,5
1,4,6
2,5,7
3,6,8


A common technique of creating a DataFrame is by using a list of pandas Series objects that will be used as the rows:

In [72]:
a = pd.Series([2,3,4,5,6])
b = pd.Series([9,7,6,5,4])

df2 = pd.DataFrame([a, b])
df2

Unnamed: 0,0,1,2,3,4
0,2,3,4,5,6
1,9,7,6,5,4


In [73]:
a = pd.Series([2,3,4,5,6])
b = pd.Series([9,7,6,5,4,5])

df2 = pd.DataFrame([a, b])
df2

Unnamed: 0,0,1,2,3,4,5
0,2.0,3.0,4.0,5.0,6.0,
1,9.0,7.0,6.0,5.0,4.0,5.0


In [77]:
df2.columns = ['a','b','c','d','e','f']
df2

Unnamed: 0,a,b,c,d,e,f
0,2.0,3.0,4.0,5.0,6.0,
1,9.0,7.0,6.0,5.0,4.0,5.0


### Replacing contents of the columns 

In [0]:
df = pd.read_csv('sample_data/file(2).csv', index_col = 'Loan_ID' )

In [19]:
df.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Replacing the column values 

In [0]:
df_copy = df.copy()
df_copy.ApplicantIncome = df_copy.ApplicantIncome.fillna(value = 0)

In [21]:
df_copy.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [0]:
df_copy.loc[:,'Loan_to_income'] = df_copy.LoanAmount / df_copy.ApplicantIncome

In [90]:
df_copy.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Loan_to_income
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y,
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,0.027929
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,0.022
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,0.046458
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,0.0235


### Deleting columns 

In [0]:
del df_copy['Loan_to_income']

In [14]:
df_copy.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
LP001002,Male,No,0,Graduate,No,5849,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,141.0,360.0,1.0,Urban,Y


In [0]:
df_copy.loc[:,'Loan_to_income'] = df_copy.LoanAmount / df_copy.ApplicantIncome

In [22]:
df_copy.pop('CoapplicantIncome')
df_copy.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
LP001002,Male,No,0,Graduate,No,5849,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,141.0,360.0,1.0,Urban,Y


If we use the pop() in an equation. The removed column is returned as series in the laft hand side series variable and the column is dropped from the original dataframe as well

In [0]:
aftercopy = df.pop('CoapplicantIncome')

In [25]:
aftercopy.head(5)

Loan_ID
LP001002       0.0
LP001003    1508.0
LP001005       0.0
LP001006    2358.0
LP001008       0.0
Name: CoapplicantIncome, dtype: float64

Third way of deleting the column is using the drop() of a dataframe. The function also has a implace option which can be used to replace the original dataframe 

In [0]:
afterdrop = df_copy.drop('ApplicantIncome', axis = 1)

In [27]:
afterdrop.head(5)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
LP001002,Male,No,0,Graduate,No,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,141.0,360.0,1.0,Urban,Y
