# Introduction to Pandas

Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. Pandas is remarkable data analysis library and it has many functions and features.

# Why Use Pandas?

1) Allows the use of labels for rows and columns

2) Can calculate rolling statistics on time series data

3) Easy handling of NaN values

4) Is able to load data of different formats into DataFrames

5) Can join and merge different datasets together

6) It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started

# Pandas Series

A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.


# Exercise 1: Create a Pandas Series that Stores a Grocery List

1) Create a grocery list with given indexes using Pandas series. 

    a) Eggs - 10

    b) Oranges - 5

    c) Milk - Yes

    d) Butter - No

2) Find the size, shape, and dimension using .size, .shape, .ndim attributes.

3) Extract the index and data using .index and .values attributes.

4) Check the index item using 'in' command.

5) Access and modify elements of the series using groceries['eggs']

6) Access and modify elements of the series using .loc and .iloc method.

7) Drop elements of the series using .drop method. Use inplace=True to reflect the changes in the original series.


In [4]:
import pandas as pd

groceries = pd.Series(index=['eggs','oranges','milk','butter'],data=[10,5,'Yes','No'])

print(groceries.head())


eggs        10
oranges      5
milk       Yes
butter      No
dtype: object


In [5]:
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements


In [6]:
# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

The data in Groceries is: [10 5 'Yes' 'No']
The index of Groceries is: Index(['eggs', 'oranges', 'milk', 'butter'], dtype='object')


In [10]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries

# We check whether bread is a food item (an index) in Groceries
y = 'eggs' in groceries

# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True


In [22]:
# We use a single index label
print('How many eggs do we need to buy:', groceries['eggs'])
print()

# we can access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) 
print()

# we use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) 
print()

# We access elements in Groceries using numerical indices:

# we use multiple numerical indices
print('How many eggs and apples do we need to buy:\n',  groceries[[0, 1]]) 
print()

# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 
print()

# We use a single numerical index
print('How many eggs do we need to buy:', groceries[0]) 
print()
# we use iloc to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]]) 

How many eggs do we need to buy: 10



Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


Do we need milk and bread:
 milk     Yes
bread    NaN
dtype: object

How many eggs and apples do we need to buy:
 eggs       10
apples    NaN
dtype: object

How many eggs and apples do we need to buy:
 eggs       10
oranges     5
dtype: object

Do we need bread:
 butter    No
dtype: object

How many eggs do we need to buy: 10

Do we need milk and bread:
 milk      Yes
butter     No
dtype: object


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  # Remove the CWD from sys.path while we load stuff.


In [23]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We change the number of eggs to 2
groceries['eggs'] = 2

# We display the changed grocery list
print()
print('Modified Grocery List:\n', groceries)

Original Grocery List:
 eggs        10
oranges      5
milk       Yes
butter      No
dtype: object

Modified Grocery List:
 eggs         2
oranges      5
milk       Yes
butter      No
dtype: object


In [28]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We remove oranges from our grocery list. The drop function removes elements out of place
print()
print('We remove oranges (out of place):\n', groceries.drop('oranges'))

# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print()
print('Grocery List after removing oranges out of place:\n', groceries)

# We remove oranges from our grocery list in place by setting the inplace keyword to True
groceries.drop('oranges', inplace = True)

# When we remove elements in place the original Series its modified. To see this
# we display our grocery list again
print()
print('Grocery List after removing oranges in place:\n', groceries)

Original Grocery List:
 eggs         2
oranges      5
milk       Yes
butter      No
dtype: object

We remove oranges (out of place):
 eggs        2
milk      Yes
butter     No
dtype: object

Grocery List after removing oranges out of place:
 eggs         2
oranges      5
milk       Yes
butter      No
dtype: object

Grocery List after removing oranges in place:
 eggs        2
milk      Yes
butter     No
dtype: object


# Exercise 2: Operations on Pandas Series

1) Basic element wise artihmetic operations: Multiplication, Division, Addition, and Subtraction.
    
2) Element wise application of mathematical functions: np.exp, np.sqrt, np.power
    
3) Selective operations on indexes.

4) Boolean operations

In [29]:
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])

# We display the fruits Pandas Series
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [30]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2:
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2:
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2:
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2:
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64



In [32]:
import numpy as np

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2


EXP(X) = 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

SQRT(X) =
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

POW(X,2) =
 apples     100
oranges     36
bananas      9
dtype: int64


In [33]:
# Selective operations on indexes
# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
print()

# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
print()

# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
print()

# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Amount of bananas + 2 =  5

Amount of apples - 2 =  8

We double the amount of apples and oranges:
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges:
 apples     5.0
oranges    3.0
dtype: float64


In [39]:
#Boolenan operations


print(fruits['bananas']>2)

True


# Exercise 3: 

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

1) For the given data, create a pandas series with distance as values and planets as index.

2) Calculate the number of minutes it takes sunlight to reach each planet. You can do this by dividing the distance from the Sun for each planet by the speed of light. Since in the data above the distance from the Sun is in units of 10^6 km, you can use a value for the speed of light of c = 18, since light travels 18 x 10^6 km

3)  Use Boolean indexing to select only those planets for which sunlight takes less than 40 minutes to reach them.

In [49]:
distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

dist_sun=pd.Series(index=planets, data=distance_from_sun)

print('The pandas series is  : \n \n ', dist_sun)
print()

# Calculate the number of minutes it takes sunligh to reach each planet
time_reach= dist_sun/18

print('The time taken in minutes for the sunlight to reach each planet: \n \n' , time_reach)
print()

# Close planets
close_planets= time_reach[time_reach<40]
print('The planets closest to sun are: \n \n', close_planets)

The pandas series is  : 
 
  Earth       149.6
Saturn     1433.5
Mars        227.9
Venus       108.2
Jupiter     778.6
dtype: float64

The time taken in minutes for the sunlight to reach each planet: 
 
 Earth       8.311111
Saturn     79.638889
Mars       12.661111
Venus       6.011111
Jupiter    43.255556
dtype: float64

The planets closest to sun are: 
 
 Earth     8.311111
Mars     12.661111
Venus     6.011111
dtype: float64


# Pandas Data Frame

andas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. In these lessons we will start by learning how to create Pandas DataFrames manually from dictionaries and later we will see how we can load data into a DataFrame from a data file.

# Exercise 4: Creating Data Frame

1) Create a dictionary first, and then load the dictionary data into pandas data frame using pd.DataFrame() function.

2) Create a dictionary without indexes, and load into a data frame.

3) Create a dictionary with values as lists, and load into a data frame.

4) Find the shape, dimensions, size of the data frame using .shape, .ndim and .size methods

5) Extract the index, column_index, and values using .index, .columns and .values respectively

6) Access the data frame using [] .loc, and .iloc methods.

In [75]:
# Create a Dictionary and load into DataFrame

items = {'Bob': pd.Series(index=['bike', 'pants', 'watch'], data=[245,55,25]),
        'Alice': pd.Series(index=['book','glasses','bike','pants'], data=[40,110,500,45]) }

type(items)

df= pd.DataFrame(items)

print(df)

           Bob  Alice
bike     245.0  500.0
book       NaN   40.0
glasses    NaN  110.0
pants     55.0   45.0
watch     25.0    NaN


In [59]:
#Find the shape, dimensions, size of the data frame using .shape, .ndim and .size methods

#Extract the index, column_index, and values using .index, .columns and .values respectively

print('The frame has shape:', df.shape)
print('The frame has dimension:', df.ndim)
print('The frame has a total of:', df.size, 'elements')
print()
print('The data in the frame is:\n', df.values)
print()
print('The row index in the frame is:', df.index)
print()
print('The column index in the frame is:', df.columns)

The frame has shape: (4, 2)
The frame has dimension: 2
The frame has a total of: 8 elements

The data in the frame is:
 [[245  40]
 [ 25 110]
 [ 55 500]
 [ 45  45]]

The row index in the frame is: RangeIndex(start=0, stop=4, step=1)

The column index in the frame is: Index(['Bob', 'Alice'], dtype='object')


In [54]:
# Create a dictionary of Pandas Series without indexes and load in to DataFrame
data = {'Bob' : pd.Series([245, 25, 55]),
        'Alice' : pd.Series([40, 110, 500, 45])}

df = pd.DataFrame(data)


print(df)

     Bob  Alice
0  245.0     40
1   25.0    110
2   55.0    500
3    NaN     45


In [2]:
import pandas as pd
#Create a Data Frame from a Dictionary with lists as values
data = {'Bob' : [245, 25, 55,45],
        'Alice' : [40, 110, 500, 45]}

df = pd.DataFrame(data)

print(df)

   Bob  Alice
0  245     40
1   25    110
2   55    500
3   45     45


In [24]:
# Access the data frame using [] .loc, and .iloc methods.

df['Bob']



0    245
1     25
2     55
3     45
Name: Bob, dtype: int64

bike       245.0
book         NaN
glasses      NaN
pants       55.0
watch       25.0
Name: Bob, dtype: float64