# pandas DataFrames

We will demonstrate the basics of pandas DataFrames in this notebook, and we will see that a DataFrame is simply a two-dimensional object whose columns are of the pandas series type.  Thus, all of the series properties and methods can be applied to dataFrame columns.

In [None]:
import pandas as pd

# Our Data
We will use some data about these three species of Iris in the remainder of this Jupyter notebook.
![title](AllIrisSpecies.jpg)

## Inputting Text Files with pandas

The pd.read_csv() method is very useful for inputting csv files.  The first row should contain the names of the columns.

In the cell below we are reading in the contents of a file that contains the Iris data that we previously worked with and assigning that data to the variable df.  Using df or using variable names starting with df is a good technique to help remind you that the data type of the variable is a DataFrame.

In [None]:
df = pd.read_csv('Iris.csv')
print('Type of df variable: ',type(df))
print('Column Labels: ',df.columns.values)

## Accessing DataFrame Elements

In [None]:
df['PetalLengthCm']

In [None]:
df['PetalWidthCm']

In [None]:
df[['PetalLengthCm','SepalLengthCm']]

In [None]:
type(df['PetalLengthCm'])

In [None]:
df['PetalLengthCm'].loc[0]

In [None]:
df['PetalLengthCm'].iloc[0]

In [None]:
df.loc[0]['PetalLengthCm']

In [None]:
df.iloc[0]['PetalLengthCm']

In [None]:
df.loc[0]

In [None]:
df.iloc[0]

## Properties and Methods to DataFrame Columns

Many of the same properties that we applied to pandas Series also apply to pandas DataFrames.

In [None]:
df.shape

In [None]:
type(df.shape)

In [None]:
df.dtypes

In [None]:
df.values

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.columns.values

In [None]:
df.values.tolist()

In [None]:
type(df.values.tolist())

In [None]:
df.columns.values.tolist()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.mean()

In [None]:
df.median()

In [None]:
df.min()

In [None]:
df.max()

Here is a link that shows other mathematical functions that can be applied to DataFrames:

[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

## Applying Series Properties and Methods to DataFrame Columns

The columns of a DataFrame are Series objects, so you can apply the Series properties and methods that we have descussed previously.

In [None]:
print(df['PetalLengthCm'].sum())
print(df['PetalLengthCm'].mean())
print(df['PetalLengthCm'].median())
print(df['PetalLengthCm'].shape)
print(df['PetalLengthCm'].describe())

In [None]:
df['PetalLengthCm'].head()

In [None]:
df['PetalLengthCm'].values

In [None]:
df['PetalLengthCm'].index

In [None]:
df['PetalLengthCm'].size

In [None]:
df['PetalLengthCm'].name

In [None]:
df['Species'].unique()

In [None]:
df['Species'].value_counts()

## Sorting DataFrame Rows

In [None]:
df.sort_values(by='SepalLengthCm',inplace=True)
df

In [None]:
df.sort_index(inplace=True)
df

In [None]:
df.sort_values(by=['PetalLengthCm'], inplace=True)
df

In [None]:
df.sort_values(by=['SepalLengthCm','SepalWidthCm'], inplace=True)
df

In [None]:
df.sort_values(by='Id',inplace=True)
df

## Creating New DataFrame Columns

Feature Engineering is a name for creating new columns that enhance the value of your analysis by, for example, increasing the accuracy of predictions.  We will demonstrate how to create new columns in a pandas DataFrame, although the task we will perform is not of monumental value.  Specifically, we will create a new column that converts the data from being measured in centimeters to being measures in inches.

Let's convert the PetalLengthCm column to inches and, in so doing, create a new column called PetalLengthIn.

The statement in the cell below is, perhaps, a bit confusing.  It seems that we are dividing a DataFrame column, on the right-hand side, by a constant.  Thus, it seems we've mixed data type: we have a pandas Series and a constant.  However, the division is interpreted element-wise, so each element is divided by 2.54.

In [None]:
InPerCm = 2.54   # Conversion factor for inches to centimeters
df['PetalLengthIn'] = df['PetalLengthCm'] / InPerCm
df['PetalLengthIn']

Sort the rows again, by PetalLengthIn

In [None]:
df.sort_values(by=['PetalLengthIn'], inplace=True)
pd.set_option('display.max_rows', 200)
df

We can observe in the column we just created that a PetalLengthIn less than 1 inch most often is assocaited with Iris-setosa, a PetalLengthIn between 1 and 1.9 inches suggests Iris-versicolor, and a PetalLengthIn greater than 1.9 suggests Iris-virginica.

Let's create a column with these predictions.  The first step is to create a function that, given a row from the DataFrame df, evaluates the value in the PetalLengthIn column and returns the appropriate species names as a string value.

In [None]:
def predict(row):
    if row['PetalLengthIn'] <= 1.0:
        return 'Iris-setosa'
    elif row['PetalLengthIn'] <= 1.9:
        return 'Iris-versicolor'
    else:
        return 'Iris-virginica'

The .apply DataFrame method causes these actions:

- Each row of the Dataframe is passed, one-by-one, to the function whose name is given as the first argument.  In this case the function name is predict, which is the function we just defined.
- The return value from the predict function for each row is appended to the new column 'Predict' in the same row that gave rise to the return value.
- The 'axis' argument determines whether the DataFrame data is sent to 'predict' by rows or by columns.

This last point is, perhaps, a bit confusing.  Specifying axis = 'columns' means that one 'set' of column values are sent to the function in each pass.  That means, by the way I think, that the data are sent row by row.  Specifying axis = 'rows' implies the converse: the total contents of each column, including the index column, are sent one-by-one to the function.

In [None]:
df['Predict'] = df.apply(predict,axis='columns')
print(df['Predict'])

## Handling Null Fields

In [None]:
import numpy as np
dfAdd = pd.DataFrame([[151,5.0,17,1.4,0.2,np.NaN,0.57,np.NaN],[152,5.0,18,1.4,0.2,np.NaN,0.57,np.NaN]], columns=['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
 'Species', 'PetalLengthIn', 'Predict'])

In [None]:
dfAdd

In [None]:
df = df.append(dfAdd)

In [None]:
df

In [None]:
df.reset_index(inplace=True)
df

In [None]:
df.dropna(inplace=True)
df

In [None]:
dfAdd = pd.DataFrame([[151,5.0,17,1.4,0.2,np.NaN,0.57,np.NaN],[152,5.0,18,1.4,0.2,np.NaN,0.57,np.NaN]], columns=['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
 'Species', 'PetalLengthIn', 'Predict'])
df = df.append(dfAdd)
df

In [None]:
df['Species'].fillna('NotSpecified',inplace=True)
df

In [None]:
df.dropna(subset = ['Predict'],inplace=True)
df