# Pandas Basics

## Pandas DataFrames

Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

There are several ways to create a DataFrame. One way way is to use a dictionary. For example:

In [1]:
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }

import pandas as pd
brics = pd.DataFrame(dict)
print(brics)

        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Dehli   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98


As you can see with the new brics DataFrame, Pandas has assigned a key for each country as the numerical values 0 through 4. If you would like to have different index values, say, the two letter country code, you can do that easily as well.

In [2]:
# Set the index for brics
brics.index = ["BR", "RU", "IN", "CH", "SA"]

# Print out brics with new index values
print(brics)

         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Dehli   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98


Another way to create a DataFrame is by importing a csv file using Pandas. Now, the csv cars.csv is stored and can be imported using pd.read_csv:

In [8]:
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv('cars.csv') #I created the csv file based on the previous data in dict

# Print out cars
print(cars)

     country  cars_per_cap  drives_right
0         US          0.80          True
1    Germany          0.60         False
2      Japan          0.50         False
3      India          0.05         False
4     Brazil          0.25         False
5  Australia          0.75         False
6     Mexico          0.30          True


## Indexing DataFrames

There are several ways to index a Pandas DataFrame. One of the easiest ways to do this is by using square bracket notation.

In the example below, you can use square brackets to select one column of the cars DataFrame. You can either use a single bracket or a double bracket. The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame.

In [25]:
# Import pandas and cars.csv
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out country column as Pandas Series
print(cars['cars_per_cap'])

country
US           0.80
Germany      0.60
Japan        0.50
India        0.05
Brazil       0.25
Australia    0.75
Mexico       0.30
Name: cars_per_cap, dtype: float64


In [26]:
# Print out country column as Pandas DataFrame
print(cars[['cars_per_cap']])

           cars_per_cap
country                
US                 0.80
Germany            0.60
Japan              0.50
India              0.05
Brazil             0.25
Australia          0.75
Mexico             0.30


In [27]:
# Print out DataFrame with country and drives_right columns
# print(cars[['country', 'drives_right']])
print(cars[['drives_right']])

           drives_right
country                
US                 True
Germany           False
Japan             False
India             False
Brazil            False
Australia         False
Mexico             True


Square brackets can also be used to access observations (rows) from a DataFrame. For example:

In [28]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out first 4 observations
print(cars[0:4])

# Print out fifth and sixth observation
print(cars[4:6])

         cars_per_cap  drives_right
country                            
US               0.80          True
Germany          0.60         False
Japan            0.50         False
India            0.05         False
           cars_per_cap  drives_right
country                              
Brazil             0.25         False
Australia          0.75         False


You can also use loc and iloc to perform just about any data selection operation. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

In [48]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Using iloc to access rows and columns by integer position
print("\nUsing iloc:")
print(cars.iloc[1])  # Accessing a specific row by integer position
print()
print(cars.iloc[2, 1])  # Accessing a specific element by row index 2 and column index 1 integer position
print()
print(cars.iloc[:, 0])  # Accessing all rows of a specific column by integer position
print()
print(cars.iloc[2:4, 1:])  # Accessing a range of rows (index 2 & 3) and columns (from index 1 to the last) by integer position
print()

# Print out observations for Australia and Egypt
print(cars.loc[['US', 'Brazil']])


Using iloc:
cars_per_cap      0.6
drives_right    False
Name: Germany, dtype: object

False

country
US           0.80
Germany      0.60
Japan        0.50
India        0.05
Brazil       0.25
Australia    0.75
Mexico       0.30
Name: cars_per_cap, dtype: float64

         drives_right
country              
Japan           False
India           False

         cars_per_cap  drives_right
country                            
US               0.80          True
Brazil           0.25         False
