# Pandas Series

In [1]:
import pandas as pd
import numpy as np

In [35]:
groceries = pd.Series(data=[30, 6, 'Yes', 'No'], index=['eggs', 'apples', 'milk', 'bread'])
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Notice that the index in the Series uses the labels passed to the Series() function.

PANDAS also have some useful built in functions like Numpy

Shape

In [6]:
groceries.shape

(4,)

In [7]:
groceries.ndim

1

In [8]:
groceries.size

4

In [9]:
groceries.index

Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')

In [10]:
groceries.values

array([30, 6, 'Yes', 'No'], dtype=object)

In [11]:
"banana" in groceries

False

In [12]:
"apples" in groceries

True

### Accessing and Deleting Elements

There are multiple ways of accessing data in a Series.

In [13]:
#Using the index label
groceries['eggs']

30

In [16]:
#providing a list of index labels
groceries[['eggs','milk']]

eggs     30
milk    Yes
dtype: object

In [19]:
#Using numeric indexes like in Numpy Arrays
groceries[0]

30

In [20]:
groceries[-1]

'No'

In [22]:
#Using list of numeric indexes
groceries[[0,1,2]]

eggs       30
apples      6
milk      Yes
dtype: object

In order to differentiate between numeric and labeled indexes, we use the loc and iloc attributes in Pandas.
Loc -> location and it is used to specify that we are using labeled indexes.
iLoc - > integer location and it is used to specify that we are using numeric indexes.

Pandas series are mutable like Numpy Arrays.


In [23]:
groceries.loc[['eggs', 'milk']]

eggs     30
milk    Yes
dtype: object

In [24]:
groceries.iloc[[0,1]]

eggs      30
apples     6
dtype: object

To modify the values in a serie, we can access it by its index and assind it a new value.

In [25]:
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

In [26]:
groceries['eggs'] = 20

In [27]:
groceries

eggs       20
apples      6
milk      Yes
bread      No
dtype: object

In [36]:
# To remove an element from a Serie use the drop function
# Note that the drop function only drops the element out of place(does not drop it from the original Serie)
groceries.drop('apples')


eggs      30
milk     Yes
bread     No
dtype: object

In [37]:
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

In [38]:
# To drop the element in place, set the inplace parameter to True as follows
groceries.drop('apples', inplace=True)

In [39]:
groceries

eggs      30
milk     Yes
bread     No
dtype: object

## Arithmetic Operations on Pandas Series

In [40]:
#Create a new pandas series
fruits = pd.Series(data=[10, 6, 3], index = ['apples', 'oranges', 'banana'])

In [41]:
fruits

apples     10
oranges     6
banana      3
dtype: int64

In [42]:
#add 2 to each elements in fruits
fruits + 2

apples     12
oranges     8
banana      5
dtype: int64

In [43]:
#Substract 2 to each elements in fruits
fruits - 2

apples     8
oranges    4
banana     1
dtype: int64

In [44]:
#Multiply each elements in fruits by 2
fruits * 2

apples     20
oranges    12
banana      6
dtype: int64

In [45]:
#Divide each elements in fruits by 2
fruits / 2

apples     5.0
oranges    3.0
banana     1.5
dtype: float64

You can also apply Numpy mathematical expresion to Pandas Series as follows:

In [48]:
#Sqrt to square all elements of a series
np.sqrt(fruits)

apples     3.162278
oranges    2.449490
banana     1.732051
dtype: float64

In [50]:
#Exponent
np.exp(fruits)

apples     22026.465795
oranges      403.428793
banana        20.085537
dtype: float64

In [51]:
#Raise all elements of fruits to the power of 2
np.power(fruits,2)

apples     100
oranges     36
banana       9
dtype: int64

To add to a selected item in the Series, access it by either it labeled or numeric index and perform the desired operation.

In [53]:
fruits

apples     10
oranges     6
banana      3
dtype: int64

In [56]:
fruits['banana'] + 2

5

In [57]:
#Using numeric index

In [58]:
fruits.iloc[0] - 2

8

In [59]:
#double apples and oranges
fruits[['apples', 'oranges']] * 2

apples     20
oranges    12
dtype: int64

In [60]:
#Divide apples and oranges by 2
fruits[['apples', 'oranges']] / 2

apples     5.0
oranges    3.0
dtype: float64

You can apply arithmetic operation on Pandas Series of mixed data type as long as the operation is defined for all the existing data types in the Series.

For example, the multiplicaiton operation is defined for both numeric and string values in Python. So multiply the groceries series by two will work and not raise an error.

In [62]:
groceries * 2

eggs         60
milk     YesYes
bread      NoNo
dtype: object

# Pandas DataFrames

Pandas DataFrames are two dimensional data structures with labeled rows and columns, that can hold many different data types. Think of it as a very powerful Excel spreadsheet.

You can manually create a Panda DataFrame or load data from a file. 

To manually create a DataFrame, you first create a dictionary with data and labels. The dictionary keys becomes the columns label while the value of the dictionary becomes the rows. The values are usually stored in a Pandas Series.


In [64]:
# Example
# 1- Create a dictionary
import pandas as pd

items = {'Bob': pd.Series(data=[245, 25, 55], index=['bike', 'pants', 'watch']),
         'Allice': pd.Series(data=[40, 110, 500, 45], index=['book', 'glasses', 'bike', 'pants'])}

shopping_cart = pd.DataFrame(items)

In [65]:
shopping_cart

Unnamed: 0,Bob,Allice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


In [78]:
# you can also create a DataFrame without an index
items2 = {'Bob': pd.Series(data=[245, 25, 55]),
         'Allice': pd.Series(data=[40, 110, 500, 45])}

shopping_cart2 = pd.DataFrame(items2)


In [80]:
# Notice that the row labels are the row index

print(shopping_cart)
print('-' * 15)
print(shopping_cart2)


           Bob  Allice
bike     245.0   500.0
book       NaN    40.0
glasses    NaN   110.0
pants     25.0    45.0
watch     55.0     NaN
---------------
     Bob  Allice
0  245.0      40
1   25.0     110
2   55.0     500
3    NaN      45


In [81]:
# DataFrame built in functions

shopping_cart.size

10

In [82]:
shopping_cart.ndim

2

In [83]:
shopping_cart.shape

(5, 2)

In [84]:
shopping_cart.index

Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

In [85]:
shopping_cart.values

array([[245., 500.],
       [ nan,  40.],
       [ nan, 110.],
       [ 25.,  45.],
       [ 55.,  nan]])

To create a DF from a subset of data, use the columns argument to grab the data for a specific key in the dic

In [86]:
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

In [87]:
bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


To create a DF with data from both keys in the dictionary, use the index argument as follows

In [88]:
sel_shopping_cart = pd.DataFrame(items, index=['pants', 'book'])

In [89]:
sel_shopping_cart

Unnamed: 0,Bob,Allice
pants,25.0,45
book,,40


You can combine the index and column arguments to select specific rows and columns from a dict

In [90]:
# Select only glasses and bike for Allice
alice_sel_shopping_cart = pd.DataFrame(items, index=['glasses', 'bike'], columns= ['Allice'])

In [91]:
alice_sel_shopping_cart

Unnamed: 0,Allice
glasses,110
bike,500


In [95]:
# Create a dict from a dict of list (arrays)
data = {'Integers': [1,2,3],
        'Floats': [4.5, 8.2, 9.6]}

df = pd.DataFrame(data)
df

Unnamed: 0,Integers,Floats
0,1,4.5
1,2,8.2
2,3,9.6


When the data doesnt have indexes, pandas automatically assigns numeric indexes. We can use the index keyword to add labels on the DF creation as follows.

In [96]:
df2 = pd.DataFrame(data, index=['label 1', 'label 2', 'label 3'])
df2

Unnamed: 0,Integers,Floats
label 1,1,4.5
label 2,2,8.2
label 3,3,9.6


We can also create a df using LIST python Dicts.

In [97]:
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
         {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5}]

# create the DF
store_items = pd.DataFrame(items2)

# display the DF
store_items

Unnamed: 0,bikes,glasses,pants,watches
0,20,,30,35
1,15,50.0,5,10


As before, we can add labels to the data when creating the DF.


In [125]:
store_items_labeled = pd.DataFrame(items2, index=['store 1', 'store 2'])

In [126]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


## Accessing Elements in Pandas

We can access elements in different ways in Pandas. We can access rows, columns or individual elements in a DF by using the row and column labels.

In [127]:
print(store_items)

   bikes  glasses  pants  watches
0     20      NaN     30       35
1     15     50.0      5       10


In [128]:
# print all the bikes on each store by using the column label 'bikes'
print('How many bikes are in each store:\n', store_items[['bikes']])

How many bikes are in each store:
    bikes
0     20
1     15


In [129]:
# Print multiple columns.
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])

How many bikes and pants are in each store:
    bikes  pants
0     20     30
1     15      5


In [130]:
# to print a specific row or store on this example, use the loc attribute as follows.
print('How many items are in store 1:\n', store_items_labeled.loc[['store 1']])

How many items are in store 1:
          bikes  glasses  pants  watches
store 1     20      NaN     30       35


In [131]:
# print a specific item from a specific store as follows:
# always provide the column label first when accessing individual elements in a DF

print('How many bikes are in store 2:\n', store_items_labeled['bikes']['store 2'])

How many bikes are in store 2:
 15


## Adding Rows or Columns to a DF

In [132]:
# Suppose we want to add a new shirts column to the store_items_labeled DF. we can do it as follows:
store_items_labeled['shirts'] = [15, 2]

store_items_labeled

Unnamed: 0,bikes,glasses,pants,watches,shirts
store 1,20,,30,35,15
store 2,15,50.0,5,10,2


We can add columns to our DF by using arithmetic operations between other columns.

In [133]:
# Make a new column called Suit by adding the shirts and pants columns

store_items_labeled['suits']= store_items_labeled['pants'] + store_items_labeled['shirts']

In [134]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,watches,shirts,suits
store 1,20,,30,35,15,45
store 2,15,50.0,5,10,2,7


Suppose now that you open a new store and you want to add it to the DF. We can do this by adding a new row to the DF with the new store info.

In [135]:
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

# Create new DF with new_items and give it a label
new_store = pd.DataFrame(new_items, index=['store 3'])

new_store

Unnamed: 0,bikes,glasses,pants,watches
store 3,20,4,30,35


In [136]:
# Add the new_store row to the original DF using the APPEND method

store_items_labeled = store_items_labeled.append(new_store)

In [137]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,suits,watches
store 1,20,,30,15.0,45.0,35
store 2,15,50.0,5,2.0,7.0,10
store 3,20,4.0,30,,,35


In [139]:
# Add column by using data from particular rows in particular columns
# Lets add a new watches column and set the value to be the same as the value in the watches columns for stores 
# 2 and 3 only. The [1:] slices the dataframe and only takes the values starting from row 1 and any other 
# rows. On this case, store 2 and 3.

store_items_labeled['new watches'] = store_items_labeled['watches'][1:]

store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,suits,watches,new watches
store 1,20,,30,15.0,45.0,35,
store 2,15,50.0,5,2.0,7.0,10,10.0
store 3,20,4.0,30,,,35,35.0


To insert a column anywhere we want, we can use the insert(loc, label, data) built in function.

In [140]:
# Lets add a new column named shoes right before the suits column. Note that suits numeric index value is 4.
# we can use its numerical value as the loc argument.

store_items_labeled.insert(4, 'shoes', [8,5,0])

In [141]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches,new watches
store 1,20,,30,15.0,8,45.0,35,
store 2,15,50.0,5,2.0,5,7.0,10,10.0
store 3,20,4.0,30,,0,,35,35.0


## Deleting rows, columns and elements from a DF

In [142]:
# pop() allows us to delete columns:
store_items_labeled.pop('new watches')

store 1     NaN
store 2    10.0
store 3    35.0
Name: new watches, dtype: float64

In [143]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,0,,35


In [146]:
# drop() can be used to delete both rows and columns using the axis argument
# drop the watches and shoes columns using the drop()

store_items_labeled = store_items_labeled.drop(['watches', 'shoes'], axis = 1)

store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 1,20,,30,15.0,45.0
store 2,15,50.0,5,2.0,7.0
store 3,20,4.0,30,,


In [147]:
# to remove the store 2 and 1 row using the drop(), do the following:

store_items_labeled = store_items_labeled.drop(['store 1', 'store 2'], axis = 0)

In [148]:
store_items_labeled

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 3,20,4.0,30,,


Sometimes, you may want to rename a column label on a DF. To do this, you can use the rename() method.

In [149]:
# rename bikes with hats

store_items_labeled = store_items_labeled.rename(columns = {'bikes': 'hats'})

store_items_labeled

Unnamed: 0,hats,glasses,pants,shirts,suits
store 3,20,4.0,30,,


In [150]:
# You can also use rename() to rename rows.

store_items_labeled = store_items_labeled.rename(index = {'store 3' : 'last store'})

store_items_labeled

Unnamed: 0,hats,glasses,pants,shirts,suits
last store,20,4.0,30,,


In [None]:
# You can also change the index to be one of the columns in your DF.

store_items_labeled = store_items_labeled.set_index('pants')


In [155]:
store_items_labeled

Unnamed: 0_level_0,hats,glasses,shirts,suits
pants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,20,4.0,,


# Dealing with NaN


In order to get more acurate models, data scientist deal with NaN values before feeding the data to the learning algorithm. Pandas offer different ways to deal with NaN values. Here are some of the functions used in Pandas to clean up data.

In [8]:
#Create a list of dictionaries
items3 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
         {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
         {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}]

# create the data frames

stores_items = pd.DataFrame(items3, index = ['store 1', 'store 2', 'store 3'])

stores_items


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,10,,35


In [11]:
# We can count the number of NaN values in the DataFrame with the isnull().sum() 

x = stores_items.isnull().sum()

print(x)

bikes      0
glasses    1
pants      0
shirts     1
shoes      0
suits      1
watches    0
dtype: int64


In [12]:
# To count the total number of NaN values in the DataFrame, we can use sum().sum()

y = stores_items.isnull().sum().sum()

print(y)

3


In [14]:
# Instead of getting the number of NaN values per column, we can use the count() to get the number of non NaN values

z = stores_items.count()

print(z)

bikes      3
glasses    2
pants      3
shirts     2
shoes      3
suits      2
watches    3
dtype: int64


### Eliminating rows and columns with NaN values

In [16]:
# use the dropna(axis) functions to drops NaN values from a specific axis. 
# This will drop the rows that contain NaN values

stores_items.dropna(axis = 0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 2,15,50.0,5,2.0,5,7.0,10


In [17]:
# drop the columns that contain NaN values

stores_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,shoes,watches
store 1,20,30,8,35
store 2,15,5,5,10
store 3,20,30,10,35


In [19]:
# Many times, instead of dropping the rows or columns we want to fill in the NaN values with other values
# WE CAN DO THIS WITH FORWARD FILLING

#Replace NaN values with previous value on the same column
print(stores_items)
stores_items.fillna(method='ffill', axis = 0)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,2.0,10,7.0,35


In [21]:
# To replace NaN values with previous values on the same row, use the folllowing:

print(stores_items)
stores_items.fillna(method = 'ffill', axis =1)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20.0,20.0,30.0,15.0,8.0,45.0,35.0
store 2,15.0,50.0,5.0,2.0,5.0,7.0,10.0
store 3,20.0,4.0,30.0,30.0,10.0,10.0,35.0


In [23]:
# we can also do backward filling by using the method = 'backfill' mode in the fillna() function instead.

# columns values
print(stores_items)
stores_items.fillna(method='backfill', axis = 0)

# rows values
stores_items.fillna(method='backfill', axis = 1)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20.0,30.0,30.0,15.0,8.0,45.0,35.0
store 2,15.0,50.0,5.0,2.0,5.0,7.0,10.0
store 3,20.0,4.0,30.0,10.0,10.0,35.0,35.0
