# pandas DataFrames

We will demonstrate the basics of pandas DataFrames in this notebook, and we will see that a DataFrame is simply a two-dimensional object whose columns are of the pandas series type.  Thus, all of the series properties and methods can be applied to dataFrame columns.

In [1]:
import pandas as pd

## Inputting Text Files with pandas

The pd.read_csv() method is very useful for inputting csv files.  The first row should contain the names of the columns.

In the cell below we are reading in the contents of a file that contains the Iris data that we previously worked with and assigning that data to the variable df.  Using df or using variable names starting with df is a good technique to help remind you that the data type of the variable is a DataFrame.

In [2]:
df = pd.read_csv('Iris.csv')
print('Type of df variable: ',type(df))
print('Column Labels: ',df.columns.values)

Type of df variable:  <class 'pandas.core.frame.DataFrame'>
Column Labels:  ['Id' 'SepalLengthCm' 'SepalWidthCm' 'PetalLengthCm' 'PetalWidthCm'
 'Species']


## Accessing DataFrame Elements

In [3]:
df['PetalLengthCm']

0      1.4
1      1.4
2      1.3
3      1.5
4      1.4
5      1.7
6      1.4
7      1.5
8      1.4
9      1.5
10     1.5
11     1.6
12     1.4
13     1.1
14     1.2
15     1.5
16     1.3
17     1.4
18     1.7
19     1.5
20     1.7
21     1.5
22     1.0
23     1.7
24     1.9
25     1.6
26     1.6
27     1.5
28     1.4
29     1.6
      ... 
120    5.7
121    4.9
122    6.7
123    4.9
124    5.7
125    6.0
126    4.8
127    4.9
128    5.6
129    5.8
130    6.1
131    6.4
132    5.6
133    5.1
134    5.6
135    6.1
136    5.6
137    5.5
138    4.8
139    5.4
140    5.6
141    5.1
142    5.1
143    5.9
144    5.7
145    5.2
146    5.0
147    5.2
148    5.4
149    5.1
Name: PetalLengthCm, Length: 150, dtype: float64

In [4]:
df['PetalWidthCm']

0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
5      0.4
6      0.3
7      0.2
8      0.2
9      0.1
10     0.2
11     0.2
12     0.1
13     0.1
14     0.2
15     0.4
16     0.4
17     0.3
18     0.3
19     0.3
20     0.2
21     0.4
22     0.2
23     0.5
24     0.2
25     0.2
26     0.4
27     0.2
28     0.2
29     0.2
      ... 
120    2.3
121    2.0
122    2.0
123    1.8
124    2.1
125    1.8
126    1.8
127    1.8
128    2.1
129    1.6
130    1.9
131    2.0
132    2.2
133    1.5
134    1.4
135    2.3
136    2.4
137    1.8
138    1.8
139    2.1
140    2.4
141    2.3
142    1.9
143    2.3
144    2.5
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Name: PetalWidthCm, Length: 150, dtype: float64

In [5]:
type(df['PetalLengthCm'])

pandas.core.series.Series

In [6]:
df['PetalLengthCm'].loc[0]

1.3999999999999999

In [7]:
df['PetalLengthCm'].iloc[0]

1.3999999999999999

In [8]:
df.loc[0]

Id                         1
SepalLengthCm            5.1
SepalWidthCm             3.5
PetalLengthCm            1.4
PetalWidthCm             0.2
Species          Iris-setosa
Name: 0, dtype: object

In [9]:
df.iloc[0]

Id                         1
SepalLengthCm            5.1
SepalWidthCm             3.5
PetalLengthCm            1.4
PetalWidthCm             0.2
Species          Iris-setosa
Name: 0, dtype: object

## Properties and Methods to DataFrame Columns

Many of the same properties that we applied to pandas Series also apply to pandas DataFrames.

In [10]:
df.shape

(150, 6)

In [11]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [12]:
df.values

array([[1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
       [6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
       [7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
       [8, 5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
       [9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
       [10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'],
       [11, 5.4, 3.7, 1.5, 0.2, 'Iris-setosa'],
       [12, 4.8, 3.4, 1.6, 0.2, 'Iris-setosa'],
       [13, 4.8, 3.0, 1.4, 0.1, 'Iris-setosa'],
       [14, 4.3, 3.0, 1.1, 0.1, 'Iris-setosa'],
       [15, 5.8, 4.0, 1.2, 0.2, 'Iris-setosa'],
       [16, 5.7, 4.4, 1.5, 0.4, 'Iris-setosa'],
       [17, 5.4, 3.9, 1.3, 0.4, 'Iris-setosa'],
       [18, 5.1, 3.5, 1.4, 0.3, 'Iris-setosa'],
       [19, 5.7, 3.8, 1.7, 0.3, 'Iris-setosa'],
       [20, 5.1, 3.8, 1.5, 0.3, 'Iris-setosa'],
       [21, 5.4, 3.4, 1.7, 0.2, 'Iris-setosa'],
 

In [13]:
df.index

RangeIndex(start=0, stop=150, step=1)

In [14]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [15]:
df.columns.values

array(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',
       'PetalWidthCm', 'Species'], dtype=object)

In [16]:
df.values.tolist()

[[1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
 [2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
 [3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
 [4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
 [5, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
 [6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
 [7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
 [8, 5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
 [9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
 [10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa'],
 [11, 5.4, 3.7, 1.5, 0.2, 'Iris-setosa'],
 [12, 4.8, 3.4, 1.6, 0.2, 'Iris-setosa'],
 [13, 4.8, 3.0, 1.4, 0.1, 'Iris-setosa'],
 [14, 4.3, 3.0, 1.1, 0.1, 'Iris-setosa'],
 [15, 5.8, 4.0, 1.2, 0.2, 'Iris-setosa'],
 [16, 5.7, 4.4, 1.5, 0.4, 'Iris-setosa'],
 [17, 5.4, 3.9, 1.3, 0.4, 'Iris-setosa'],
 [18, 5.1, 3.5, 1.4, 0.3, 'Iris-setosa'],
 [19, 5.7, 3.8, 1.7, 0.3, 'Iris-setosa'],
 [20, 5.1, 3.8, 1.5, 0.3, 'Iris-setosa'],
 [21, 5.4, 3.4, 1.7, 0.2, 'Iris-setosa'],
 [22, 5.1, 3.7, 1.5, 0.4, 'Iris-setosa'],
 [23, 4.6, 3.6, 1.0, 0.2, 'Iris-setosa'],
 [24, 5.1, 3.3, 1.7, 0.5, 'Iris-setosa'],
 

In [17]:
type(df.values.tolist())

list

In [18]:
df.columns.values.tolist()

['Id',
 'SepalLengthCm',
 'SepalWidthCm',
 'PetalLengthCm',
 'PetalWidthCm',
 'Species']

In [19]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [20]:
df.mean()

Id               75.500000
SepalLengthCm     5.843333
SepalWidthCm      3.054000
PetalLengthCm     3.758667
PetalWidthCm      1.198667
dtype: float64

In [21]:
df.median()

Id               75.50
SepalLengthCm     5.80
SepalWidthCm      3.00
PetalLengthCm     4.35
PetalWidthCm      1.30
dtype: float64

In [22]:
df.min()

Id                         1
SepalLengthCm            4.3
SepalWidthCm               2
PetalLengthCm              1
PetalWidthCm             0.1
Species          Iris-setosa
dtype: object

In [23]:
df.max()

Id                          150
SepalLengthCm               7.9
SepalWidthCm                4.4
PetalLengthCm               6.9
PetalWidthCm                2.5
Species          Iris-virginica
dtype: object

Here is a link that shows other mathematical functions that can be applied to DataFrames:

[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

## Applying Series Properties and Methods to DataFrame Columns

The columns of a DataFrame are Series objects, so you can apply the Series properties and methods that we have descussed previously.

In [24]:
print(df['PetalLengthCm'].sum())
print(df['PetalLengthCm'].mean())
print(df['PetalLengthCm'].median())
print(df['PetalLengthCm'].shape)
print(df['PetalLengthCm'].describe())

563.8000000000004
3.7586666666666693
4.35
(150,)
count    150.000000
mean       3.758667
std        1.764420
min        1.000000
25%        1.600000
50%        4.350000
75%        5.100000
max        6.900000
Name: PetalLengthCm, dtype: float64


In [25]:
df['PetalLengthCm'].head()

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: PetalLengthCm, dtype: float64

In [26]:
df['PetalLengthCm'].values

array([ 1.4,  1.4,  1.3,  1.5,  1.4,  1.7,  1.4,  1.5,  1.4,  1.5,  1.5,
        1.6,  1.4,  1.1,  1.2,  1.5,  1.3,  1.4,  1.7,  1.5,  1.7,  1.5,
        1. ,  1.7,  1.9,  1.6,  1.6,  1.5,  1.4,  1.6,  1.6,  1.5,  1.5,
        1.4,  1.5,  1.2,  1.3,  1.5,  1.3,  1.5,  1.3,  1.3,  1.3,  1.6,
        1.9,  1.4,  1.6,  1.4,  1.5,  1.4,  4.7,  4.5,  4.9,  4. ,  4.6,
        4.5,  4.7,  3.3,  4.6,  3.9,  3.5,  4.2,  4. ,  4.7,  3.6,  4.4,
        4.5,  4.1,  4.5,  3.9,  4.8,  4. ,  4.9,  4.7,  4.3,  4.4,  4.8,
        5. ,  4.5,  3.5,  3.8,  3.7,  3.9,  5.1,  4.5,  4.5,  4.7,  4.4,
        4.1,  4. ,  4.4,  4.6,  4. ,  3.3,  4.2,  4.2,  4.2,  4.3,  3. ,
        4.1,  6. ,  5.1,  5.9,  5.6,  5.8,  6.6,  4.5,  6.3,  5.8,  6.1,
        5.1,  5.3,  5.5,  5. ,  5.1,  5.3,  5.5,  6.7,  6.9,  5. ,  5.7,
        4.9,  6.7,  4.9,  5.7,  6. ,  4.8,  4.9,  5.6,  5.8,  6.1,  6.4,
        5.6,  5.1,  5.6,  6.1,  5.6,  5.5,  4.8,  5.4,  5.6,  5.1,  5.1,
        5.9,  5.7,  5.2,  5. ,  5.2,  5.4,  5.1])

In [27]:
df['PetalLengthCm'].index

RangeIndex(start=0, stop=150, step=1)

In [28]:
df['PetalLengthCm'].size

150

In [29]:
df['PetalLengthCm'].name

'PetalLengthCm'

In [30]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [31]:
df['Species'].value_counts()

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

## Sorting DataFrame Rows

In [32]:
df.sort_values(by='SepalLengthCm',inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
13,14,4.3,3.0,1.1,0.1,Iris-setosa
42,43,4.4,3.2,1.3,0.2,Iris-setosa
38,39,4.4,3.0,1.3,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
41,42,4.5,2.3,1.3,0.3,Iris-setosa
22,23,4.6,3.6,1.0,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
47,48,4.6,3.2,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa


In [33]:
df.sort_index(inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


In [34]:
df.sort_values(by=['SepalLengthCm','SepalWidthCm'], inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
13,14,4.3,3.0,1.1,0.1,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
38,39,4.4,3.0,1.3,0.2,Iris-setosa
42,43,4.4,3.2,1.3,0.2,Iris-setosa
41,42,4.5,2.3,1.3,0.3,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
47,48,4.6,3.2,1.4,0.2,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
22,23,4.6,3.6,1.0,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa


In [35]:
df.sort_values(by='Id',inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


## Creating New DataFrame Columns

Feature Engineering is a name for creating new columns that enhance the value of your analysis by, for example, increasing the accuracy of predictions.  We will demonstrate how to create new columns in a pandas DataFrame, although the task we will perform is not of monumental value.  Specifically, we will create a new column that converts the data from being measured in centimeters to being measures in inches.

Let's convert the PetalLengthCm column to inches and, in so doing, create a new column called PetalLengthIn.

The statement in the cell below is, perhaps, a bit confusing.  It seems that we are dividing a dataFrame column, on the right-hand side, by a constant.  Thus, it seems we've mixed data type: we have a pandas Series and a constant.  However, the division is interpreted element-wise, so each element is divided by 2.54.

In [36]:
InPerCm = 2.54   # Conversion factor for inches to centimeters
df['PetalLengthIn'] = df['PetalLengthCm'] / InPerCm
df['PetalLengthIn']

0      0.551181
1      0.551181
2      0.511811
3      0.590551
4      0.551181
5      0.669291
6      0.551181
7      0.590551
8      0.551181
9      0.590551
10     0.590551
11     0.629921
12     0.551181
13     0.433071
14     0.472441
15     0.590551
16     0.511811
17     0.551181
18     0.669291
19     0.590551
20     0.669291
21     0.590551
22     0.393701
23     0.669291
24     0.748031
25     0.629921
26     0.629921
27     0.590551
28     0.551181
29     0.629921
         ...   
120    2.244094
121    1.929134
122    2.637795
123    1.929134
124    2.244094
125    2.362205
126    1.889764
127    1.929134
128    2.204724
129    2.283465
130    2.401575
131    2.519685
132    2.204724
133    2.007874
134    2.204724
135    2.401575
136    2.204724
137    2.165354
138    1.889764
139    2.125984
140    2.204724
141    2.007874
142    2.007874
143    2.322835
144    2.244094
145    2.047244
146    1.968504
147    2.047244
148    2.125984
149    2.007874
Name: PetalLengthIn, Len

Recall that we previously saw that we might predict Iris species based on the PetalLengthIn dimension, which is the column we just created.  Recall that a PetalLengthIn less than 1 inch indicated Iris-setosa, a PetalLengthIn between 1 and 1.9 inches suggests Iris-versicolor, and a PetalLengthIn greater than 1.9 suggests Iris-virginica.

Let's create a column with these predictions.  The first step is to create a function that, given a row from the DataFrame df, evaluates the value in the PetalLengthIn column and returns the appropriate species names as a string value.

In [39]:
def predict(row):
    if row['PetalLengthIn'] <= 1.0:
        return 'Iris-setosa'
    elif row['PetalLengthIn'] <= 1.9:
        return 'Iris-versicolor'
    else:
        return 'Iris-virginica'

The .apply DataFrame method causes these actions:

- Each row of the Dataframe is passed, one-by-one, to the function whose name is given as the first argument.  In this case the function name is predict, which is the function we just defined.
- The return value from the predict function for each row is appended to the new column 'Predict' in the same row that gave rise to the return value.
- The 'axis' argument determines whether the DataFrame data is sent to 'predict' by rows or by columns.

This last point is, perhaps, a bit confusing.  Specifying axis = 'columns' means that one 'set' of column values are sent to the function in each pass.  That means, by the way I think, that the data are sent row by row.  Specifying axis = 'rows' implies the converse: the total contents of each column, including the index column, are sent one-by-one to the function.

In [40]:
df['Predict'] = df.apply(predict,axis='columns')
print(df['Predict'])

0          Iris-setosa
1          Iris-setosa
2          Iris-setosa
3          Iris-setosa
4          Iris-setosa
5          Iris-setosa
6          Iris-setosa
7          Iris-setosa
8          Iris-setosa
9          Iris-setosa
10         Iris-setosa
11         Iris-setosa
12         Iris-setosa
13         Iris-setosa
14         Iris-setosa
15         Iris-setosa
16         Iris-setosa
17         Iris-setosa
18         Iris-setosa
19         Iris-setosa
20         Iris-setosa
21         Iris-setosa
22         Iris-setosa
23         Iris-setosa
24         Iris-setosa
25         Iris-setosa
26         Iris-setosa
27         Iris-setosa
28         Iris-setosa
29         Iris-setosa
            ...       
120     Iris-virginica
121     Iris-virginica
122     Iris-virginica
123     Iris-virginica
124     Iris-virginica
125     Iris-virginica
126    Iris-versicolor
127     Iris-virginica
128     Iris-virginica
129     Iris-virginica
130     Iris-virginica
131     Iris-virginica
132     Iri

## Handling Null Fields

In [41]:
import numpy as np
dfAdd = pd.DataFrame([[151,5.0,17,1.4,0.2,np.NaN,0.57,np.NaN],[152,5.0,18,1.4,0.2,np.NaN,0.57,np.NaN]], columns=['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
 'Species', 'PetalLengthIn', 'Predict'])

In [42]:
dfAdd

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,151,5.0,17,1.4,0.2,,0.57,
1,152,5.0,18,1.4,0.2,,0.57,


In [43]:
df = df.append(dfAdd)

In [44]:
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.511811,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa,0.669291,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa,0.551181,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa,0.590551,Iris-setosa


In [45]:
df.dropna(inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.511811,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa,0.669291,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa,0.551181,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa,0.590551,Iris-setosa


In [46]:
dfAdd = pd.DataFrame([[151,5.0,17,1.4,0.2,np.NaN,0.57,np.NaN],[152,5.0,18,1.4,0.2,np.NaN,0.57,np.NaN]], columns=['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
 'Species', 'PetalLengthIn', 'Predict'])
df = df.append(dfAdd)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.511811,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa,0.669291,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa,0.551181,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa,0.590551,Iris-setosa


In [47]:
df['Species'].fillna('NotSpecified',inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.511811,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa,0.669291,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa,0.551181,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa,0.590551,Iris-setosa


In [48]:
df.dropna(subset = ['Predict'],inplace=True)
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalLengthIn,Predict
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.511811,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa,0.669291,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa,0.551181,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa,0.590551,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa,0.551181,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa,0.590551,Iris-setosa
