## __Pandas DataFrame__

The pandas DataFrame is a two-dimensional data structure that is in a tabular format.

## Step 1: Import Pandas and Create an Empty DataFrame

- Import the pandas library to create a DataFrame:


In [1]:
import pandas as pd

To create a DataFrame, we need to call DataFrame() in the pandas library.

Let's create a DataFrame **df**.

In [2]:
df = pd.DataFrame()

Now, let's find the type of the DataFrame **df**.

In [3]:
type(df)

pandas.core.frame.DataFrame

**Observation**

The type of the DataFrame is **pandas.core.frame.DataFrame**.

## Step 2: Read a CSV File

We can load the data into a DataFrame from various data files. Here, we will load the data from a CSV file.

- Read a CSV file using the pd.read_csv() function:


In [4]:
df = pd.read_csv('../../Datasets/PandasExample.csv')

Now, let's print the DataFrame **df** to see the data.

We can also print the first or last five rows of a DataFrame using the head and tail functions.

In [5]:
df

Unnamed: 0,Name,Age,Gender
0,Nithin,24,Male
1,Manoj,30,Male
2,Shivashankar,44,Male
3,Swathi,18,Female
4,Pareekshith,28,Male


**Observation**

- The result is in a tabular format, which has rows and columns.

- The row indices are generated.

- There are 3 columns: **Name**, **Age** and **Gender**.



## Step 3: Display the First and Last Rows

Display the first and last rows of the DataFrame using the head() and tail() methods:

- head() returns the first 5 rows of a DataFrame.

- tail() returns the last 5 rows of a DataFrame.

In [6]:

df.head()

Unnamed: 0,Name,Age,Gender
0,Nithin,24,Male
1,Manoj,30,Male
2,Shivashankar,44,Male
3,Swathi,18,Female
4,Pareekshith,28,Male


In [7]:
df.tail()

Unnamed: 0,Name,Age,Gender
0,Nithin,24,Male
1,Manoj,30,Male
2,Shivashankar,44,Male
3,Swathi,18,Female
4,Pareekshith,28,Male


We can also specify the number of rows we want to display by passing it to the head or tail functions.

- Print the first 2 rows of the DataFrame **df**:


In [8]:
df.head(2)

Unnamed: 0,Name,Age,Gender
0,Nithin,24,Male
1,Manoj,30,Male


Print the last two rows by passing 2 as an argument:

In [9]:
df.tail(2)

Unnamed: 0,Name,Age,Gender
3,Swathi,18,Female
4,Pareekshith,28,Male


## Step 4: Index-Based Accessing

We can access elements using the .iloc method for which we need to pass integer-based indices. That is, from 0 to n-1, where n is the total number of rows or columns.

- Access rows in a DataFrame using the .iloc method:


In [10]:
df.iloc[0]

Name      Nithin
Age           24
Gender      Male
Name: 0, dtype: object

**Observation**

The result is the data in the first row.

The result also shows the data type. The data type is **object** since each column contains different types of data.

Now, let's print the data in the 4<sup>th</sup> row.

In [11]:
df.iloc[3]

Name      Swathi
Age           18
Gender    Female
Name: 3, dtype: object

**Observation**

This is the data from the 4<sup>th</sup> row.

## Step 5: Access DataFrame Values as a NumPy Array

We can only retrieve values from a DataFrame.

- Access the DataFrame values as a NumPy array using the .values attribute:


In [12]:
df.values

array([['Nithin', 24, 'Male'],
       ['Manoj', 30, 'Male'],
       ['Shivashankar', 44, 'Male'],
       ['Swathi', 18, 'Female'],
       ['Pareekshith', 28, 'Male']], dtype=object)

## Step 6: Read a CSV File in Chunks

We can also read the file in chunks.

Reading the file will return an iterable. We need to iterate through it.

To read the CSV file in chunks, we use the **chunk size** parameter in the pd.read_csv() function.


- Read a CSV file in chunks:

In [13]:
df = pd.read_csv('../../Datasets/PandasExample.csv',chunksize=2)

Now, let's print the chunk.

In [14]:
for chunk in df:
  print(chunk)

     Name  Age Gender
0  Nithin   24   Male
1   Manoj   30   Male
           Name  Age  Gender
2  Shivashankar   44    Male
3        Swathi   18  Female
          Name  Age Gender
4  Pareekshith   28   Male


**Observation**

We can see that three separate DataFrames are created with chunk size 2.

## Step 7: Filter the DataFrame Based on a Condition

We can also filter the data using conditions in Pandas. For example, if we were to print ages that are only above 25, we could do that.

- Read pandasExample.csv into DataFrame **df**
- Access the **Age** column from **df**, and check df['Age'] > 25
- Extract only that data from **df**
- Assign that to **df**


Let's read the CSV file.

In [15]:
df = pd.read_csv('../../Datasets/PandasExample.csv')

Now, let's see how to extract the data where **Age** is greater than 25.

In [16]:
df = df[df['Age']>25]

Print **df**:

In [17]:
df

Unnamed: 0,Name,Age,Gender
1,Manoj,30,Male
2,Shivashankar,44,Male
4,Pareekshith,28,Male


**Observation**

The DataFrame now shows records where **Age** is greater than 25.

**DataFrame**
#A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, 
#each of which can be a different value type (numeric,string, boolean, etc.). The DataFrame has both a row and 
#column index; it can be thought of as a dict of Series (one for all sharing the same index). 

#create a dataframe from a dict of equal-length lists or NumPy arrays

In [18]:
data = {'eucountry': ['france', 'germany', 'austria', 'sweden', 'Norway'],
'year': [2000, 2001, 2002, 2001, 2002],
'popul': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [19]:
frame = pd.DataFrame(data)

In [20]:
frame

Unnamed: 0,eucountry,year,popul
0,france,2000,1.5
1,germany,2001,1.7
2,austria,2002,3.6
3,sweden,2001,2.4
4,Norway,2002,2.9


In [21]:
#index assigned automatically as with Series, and the columns are placed in sorted order
print(frame)

  eucountry  year  popul
0    france  2000    1.5
1   germany  2001    1.7
2   austria  2002    3.6
3    sweden  2001    2.4
4    Norway  2002    2.9


In [22]:
#if we specify a sequence of columns, the DataFrame’s columns will be exactly what we pass
print(pd.DataFrame(data, columns=['year', 'eucountry', 'popul']))
framenew = pd.DataFrame(data, columns=['year', 'eucountry', 'popul'])

   year eucountry  popul
0  2000    france    1.5
1  2001   germany    1.7
2  2002   austria    3.6
3  2001    sweden    2.4
4  2002    Norway    2.9


In [23]:
framenew

Unnamed: 0,year,eucountry,popul
0,2000,france,1.5
1,2001,germany,1.7
2,2002,austria,3.6
3,2001,sweden,2.4
4,2002,Norway,2.9


In [24]:
#As with Series, if you pass a column that isn’t contained in data, it will appear with NA values in the result
frame2 = pd.DataFrame(data, columns=['year', 'eucountry', 'popul', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])

In [25]:
frame2.head()

Unnamed: 0,year,eucountry,popul,debt
one,2000,france,1.5,
two,2001,germany,1.7,
three,2002,austria,3.6,
four,2001,sweden,2.4,
five,2002,Norway,2.9,


In [26]:
#A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
print(frame2['eucountry'])


one       france
two      germany
three    austria
four      sweden
five      Norway
Name: eucountry, dtype: object


In [27]:
print(frame2.year)

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64


In [30]:
#Rows can also be retrieved by position or name or such as the ix indexing field
#FutureWarning: 
#.ix is deprecated. Please use
#.loc for label based indexing or
#.iloc for positional indexing
#print(frame2.ix['three'])

In [32]:
print(frame2.loc['three'])
#print(frame2.loc[['three','one']])
#print(frame2.iloc[3])

year            2002
eucountry    austria
popul            3.6
debt             NaN
Name: three, dtype: object


In [33]:
#Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar 
#value or an array of values
frame2['debt'] = 16.5

In [34]:
print(frame2)

       year eucountry  popul  debt
one    2000    france    1.5  16.5
two    2001   germany    1.7  16.5
three  2002   austria    3.6  16.5
four   2001    sweden    2.4  16.5
five   2002    Norway    2.9  16.5


In [35]:
frame2['debt'] == 100

one      False
two      False
three    False
four     False
five     False
Name: debt, dtype: bool

In [36]:
frame2['debt'] != 100

one      True
two      True
three    True
four     True
five     True
Name: debt, dtype: bool

In [38]:
import numpy as np
frame2['debt'] = np.arange(5.)
print(frame2)

       year eucountry  popul  debt
one    2000    france    1.5   0.0
two    2001   germany    1.7   1.0
three  2002   austria    3.6   2.0
four   2001    sweden    2.4   3.0
five   2002    Norway    2.9   4.0


In [39]:
#When assigning lists or arrays to a column, the value’s length must match the length
#of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
#DataFrame’s index, inserting missing values in any holes:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val

In [40]:
frame2

Unnamed: 0,year,eucountry,popul,debt
one,2000,france,1.5,
two,2001,germany,1.7,-1.2
three,2002,austria,3.6,
four,2001,sweden,2.4,-1.5
five,2002,Norway,2.9,-1.7


In [41]:
#Assigning a column that doesn’t exist will create a new column. 
frame2['eastern'] = frame2.eucountry == 'france'
print(frame2)

       year eucountry  popul  debt  eastern
one    2000    france    1.5   NaN     True
two    2001   germany    1.7  -1.2    False
three  2002   austria    3.6   NaN    False
four   2001    sweden    2.4  -1.5    False
five   2002    Norway    2.9  -1.7    False


In [42]:
frame2['abvthold'] = frame2.popul > 1.6

In [43]:
frame2

Unnamed: 0,year,eucountry,popul,debt,eastern,abvthold
one,2000,france,1.5,,True,False
two,2001,germany,1.7,-1.2,False,True
three,2002,austria,3.6,,False,True
four,2001,sweden,2.4,-1.5,False,True
five,2002,Norway,2.9,-1.7,False,True


In [44]:
#The del keyword will delete columns as with a dict:
del frame2['eastern']
del frame2['abvthold']

In [45]:
frame2.columns

Index(['year', 'eucountry', 'popul', 'debt'], dtype='object')

_The column returned when indexing a DataFrame is a view on the underlying
#data, not a copy. Thus, any in-place modifications to the Series
#will be reflected in the DataFrame. The column can be explicitly copied
#using the Series’s copy method.


In [46]:
#Another common form of data is a nested dict of dicts format:
pop = {'norway': {2001: 2.4, 2002: 2.9},'denmark': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'norway': {2001: 2.4, 2002: 2.9},
 'denmark': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [47]:
#If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
#keys as the row indices:
frame3 = pd.DataFrame(pop)

In [48]:
frame3

Unnamed: 0,norway,denmark
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [49]:
#you can always transpose the result
print(frame3.T)

         2001  2002  2000
norway    2.4   2.9   NaN
denmark   1.7   3.6   1.5


In [50]:
#The keys in the inner dicts are unioned and sorted to form the index in the result. This
#isn’t true if an explicit index is specified:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,norway,denmark
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [51]:
#Dicts of Series are treated much in the same way:
pdata = {'norway': frame3['norway'][:-1],
'denmark': frame3['denmark'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,norway,denmark
2001,2.4,1.7
2002,2.9,3.6


In [52]:
#If a DataFrame’s index and columns have their name attributes set, these will also be
#displayed:
print(frame3)
frame3.index.name = 'year'; 
frame3.columns.name = 'eucountry'

      norway  denmark
2001     2.4      1.7
2002     2.9      3.6
2000     NaN      1.5


In [53]:
#Like Series, the values attribute returns the data contained in the DataFrame as a 2D #ndarray:
print(frame3.values)

[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


In [54]:
#If the DataFrame’s columns are different dtypes, the dtype of the values array will be
#chosen to accomodate all of the columns:
print(frame2.values)

[[2000 'france' 1.5 nan]
 [2001 'germany' 1.7 -1.2]
 [2002 'austria' 3.6 nan]
 [2001 'sweden' 2.4 -1.5]
 [2002 'Norway' 2.9 -1.7]]
