Pandas:
    
    Allows the use of labels for rows and columns
    Can calculate rolling statistics on time series data
    Easy handling of NaN values
    Is able to load data of different formats into DataFrames
    Can join and merge different datasets together
    It integrates with NumPy and Matplotlib


In [1]:
import pandas as pd

Series

In [2]:
# We import Pandas as pd into Python
import pandas as pd

# We create a Pandas Series that stores a grocery list
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])

# We display the Groceries Pandas Series
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Some Series Attributes

In [3]:
# We print some information about Groceries
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements


Give us values and indexes

In [4]:
# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')


We can use 'in' to check if series contains an item

In [6]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries

# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries

# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True


Various ways to index in Pandas Series

In [7]:
# We access elements in Groceries using index labels:

# We use a single index label
print('How many eggs do we need to buy:', groceries['eggs'])
print()

# we can access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) 
print()

# we use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) 
print()

# We access elements in Groceries using numerical indices:

# we use multiple numerical indices
print('How many eggs and apples do we need to buy:\n',  groceries[[0, 1]]) 
print()

# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 
print()

# We use a single numerical index
print('How many eggs do we need to buy:', groceries[0]) 
print()
# we use iloc to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]])

How many eggs do we need to buy: 30

Do we need milk and bread:
 milk     Yes
bread     No
dtype: object

How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object

How many eggs and apples do we need to buy:
 eggs      30
apples     6
dtype: object

Do we need bread:
 bread    No
dtype: object

How many eggs do we need to buy: 30

Do we need milk and bread:
 milk     Yes
bread     No
dtype: object


Note that Pandas Series is mutable

In [8]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We change the number of eggs to 2
groceries['eggs'] = 2

# We display the changed grocery list
print()
print('Modified Grocery List:\n', groceries)

Original Grocery List:
 eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Modified Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object


Delete items in a Pandas Series

In [9]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We remove apples from our grocery list. The drop function removes elements out of place
print()
print('We remove apples (out of place):\n', groceries.drop('apples'))

# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples out of place:\n', groceries)

Original Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

We remove apples (out of place):
 eggs       2
milk     Yes
bread     No
dtype: object

Grocery List after removing apples out of place:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object


Using inplace = True can modify the Series

In [10]:
# We display the original grocery list
print('Original Grocery List:\n', groceries)

# We remove apples from our grocery list in place by setting the inplace keyword to True
groceries.drop('apples', inplace = True)

# When we remove elements in place the original Series its modified. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples in place:\n', groceries)

Original Grocery List:
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

Grocery List after removing apples in place:
 eggs       2
milk     Yes
bread     No
dtype: object


Arithmetic Operations on Panda Series

In [11]:
# We create a Pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])

# We display the fruits Pandas Series
fruits

apples     10
oranges     6
bananas     3
dtype: int64

All these operations are broadcasted to the entire Series

In [13]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2:
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2:
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2:
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2:
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64



Can also apply functions

In [14]:
# We import NumPy as np to be able to use the mathematical functions
import numpy as np

# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2


Original grocery list of fruits:
 apples     10
oranges     6
bananas     3
dtype: int64

EXP(X) = 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

SQRT(X) =
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

POW(X,2) =
 apples     100
oranges     36
bananas      9
dtype: int64


Applying math operations to selected items in the Series

In [15]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
print()

# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
print()

# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
print()

# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
print()

# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

Amount of bananas + 2 =  5

Amount of apples - 2 =  8

We double the amount of apples and oranges:
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges:
 apples     5.0
oranges    3.0
dtype: float64


Note that multipication is valid for both string and numbers, so no error here

In [17]:
# We multiply our grocery list by 2
groceries * 2

eggs          4
milk     YesYes
bread      NoNo
dtype: object

In [21]:
#Question
import pandas as pd

# Create a Pandas Series that contains the distance of some planets from the Sun.
# Use the name of the planets as the index to your Pandas Series, and the distance
# from the Sun as your data. The distance from the Sun is in units of 10^6 km

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

# Create a Pandas Series using the above data, with the name of the planets as
# the index and the distance from the Sun as your data.
dist_planets = pd.Series(data = distance_from_sun, index = planets)

# Calculate the number of minutes it takes sunlight to reach each planet. You can
# do this by dividing the distance from the Sun for each planet by the speed of light.
# Since in the data above the distance from the Sun is in units of 10^6 km, you can
# use a value for the speed of light of c = 18, since light travels 18 x 10^6 km/minute.
time_light = dist_planets / 18

# Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
close_planets = time_light[time_light < 40]

In [23]:
#Answer
import pandas as pd

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

dist_planets = pd.Series(data = distance_from_sun, index = planets)

time_light = dist_planets / 18

close_planets = time_light[time_light < 40]

Pandas DataFrames

create DataFrame from dictionary

In [24]:
# We import Pandas as pd into Python
import pandas as pd

# We create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# We print the type of items to see that it is a dictionary
print(type(items))

<class 'dict'>


stick into dataframe function

displayed in a tabular form. the index is a union. column labels are keys in the dictionary. NaN is 'not a number', which means missing value

In [26]:
# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)

# We display the DataFrame
shopping_carts

Unnamed: 0,Alice,Bob
bike,500.0,245.0
book,40.0,
glasses,110.0,
pants,45.0,25.0
watch,,55.0


Same dictionaries without the row labels

In [27]:
# We create a dictionary of Pandas Series without indexes
data = {'Bob' : pd.Series([245, 25, 55]),
        'Alice' : pd.Series([40, 110, 500, 45])}

# We create a DataFrame
df = pd.DataFrame(data)

# We display the DataFrame
df

Unnamed: 0,Alice,Bob
0,40,245.0
1,110,25.0
2,500,55.0
3,45,


many attributes of a DataFrame

In [28]:
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
 [[500. 245.]
 [ 40.  nan]
 [110.  nan]
 [ 45.  25.]
 [ nan  55.]]

The row index in shopping_carts is: Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: Index(['Alice', 'Bob'], dtype='object')


DataFrame with only selected columns or rows

In [29]:
# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

# We display bob_shopping_cart
bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


In [30]:
# We Create a DataFrame that only has selected items for both Alice and Bob
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])

# We display sel_shopping_cart
sel_shopping_cart

Unnamed: 0,Alice,Bob
pants,45,25.0
book,40,


In [31]:
# We Create a DataFrame that only has selected items for Alice
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])

# We display alice_sel_shopping_cart
alice_sel_shopping_cart

Unnamed: 0,Alice
glasses,110
bike,500


Create DataFrame from dictionary of lists. Lists need to be the same lengths

In [32]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame 
df = pd.DataFrame(data)

# We display the DataFrame
df

Unnamed: 0,Floats,Integers
0,4.5,1
1,8.2,2
2,9.6,3


In [33]:
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

# We display the DataFrame
df

Unnamed: 0,Floats,Integers
label 1,4.5,1
label 2,8.2,2
label 3,9.6,3


List of dictionaries

In [48]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame 
store_items = pd.DataFrame(items2)

# We display the DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches
0,20,,30,35
1,15,50.0,5,10


In [49]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


Different way to access columns or rows

In [50]:
# We print the store_items DataFrame
print(store_items)

# We access rows, columns and elements using labels
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print()
print('How many bikes are in Store 2:', store_items['bikes']['store 2']) #column label first, then row label

         bikes  glasses  pants  watches
store 1     20      NaN     30       35
store 2     15     50.0      5       10

How many bikes are in each store:
          bikes
store 1     20
store 2     15

How many bikes and pants are in each store:
          bikes  pants
store 1     20     30
store 2     15      5

What items are in Store 1:
          bikes  glasses  pants  watches
store 1     20      NaN     30       35

How many bikes are in Store 2: 15


adding a new column

In [51]:
# We add a new column named shirts to our store_items DataFrame indicating the number of
# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
store_items['shirts'] = [15,2]

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts
store 1,20,,30,35,15
store 2,15,50.0,5,10,2


adding new columns using arithmetic operations

In [52]:
# We make a new column called suits by adding the number of shirts and pants
store_items['suits'] = store_items['pants'] + store_items['shirts']

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts,suits
store 1,20,,30,35,15,45
store 2,15,50.0,5,10,2,7


DataFrames can also be appended (stacked)

In [53]:
# We create a dictionary from a list of Python dictionaries that will number of items at the new store
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]

# We create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items, index = ['store 3'])

# We display the items at the new store
new_store

Unnamed: 0,bikes,glasses,pants,watches
store 3,20,4,30,35


In [54]:
# We append store 3 to our store_items DataFrame
store_items = store_items.append(new_store)

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits,watches
store 1,20,,30,15.0,45.0,35
store 2,15,50.0,5,2.0,7.0,10
store 3,20,4.0,30,,,35


we can create a new column, and specify values to take

In [55]:
# We add a new column using data from particular rows in the watches column
store_items['new watches'] = store_items['watches'][1:]

# We display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits,watches,new watches
store 1,20,,30,15.0,45.0,35,
store 2,15,50.0,5,2.0,7.0,10,10.0
store 3,20,4.0,30,,,35,35.0


using insert and specify the locations to add the column

In [56]:
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches,new watches
store 1,20,,30,15.0,8,45.0,35,
store 2,15,50.0,5,2.0,5,7.0,10,10.0
store 3,20,4.0,30,,0,,35,35.0


using pop to remove the last column

In [57]:
# We remove the new watches column
store_items.pop('new watches')

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,0,,35


this is using drop to remove columns or rows, using the axis keyword. here, axis = 1 is column, and axis = 0 is row

In [58]:
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 1,20,,30,15.0,45.0
store 2,15,50.0,5,2.0,7.0
store 3,20,4.0,30,,


In [59]:
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

# we display the modified DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 3,20,4.0,30,,


changing column labels using rename. specify column

In [60]:
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})

# we display the modified DataFrame
store_items

Unnamed: 0,hats,glasses,pants,shirts,suits
store 3,20,4.0,30,,


In [61]:
also use rename to change row label. specify index

SyntaxError: invalid syntax (<ipython-input-61-3c8b5a9d541f>, line 1)

In [62]:
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})

# we display the modified DataFrame
store_items

Unnamed: 0,hats,glasses,pants,shirts,suits
last store,20,4.0,30,,


use an existing column to be index

In [68]:
# We change the row index to be the data in the pants column
store_items = store_items.set_index('glasses')

# we display the modified DataFrame
store_items

Unnamed: 0_level_0,hats,shirts,suits
glasses,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4.0,20,,


Dealing with NaN (Not a Number)

In [70]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,10,,35


Using the isnull function to identify the NaN, then sum it at the column level, then at the entire dataframe level

In [72]:
# We count the number of NaN values in store_items
x =  store_items.isnull().sum().sum()

# We print x
print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: 3


If we just do one sum, then we see it at the column level

In [73]:
store_items.isnull().sum()

bikes      0
glasses    1
pants      0
shirts     1
shoes      0
suits      1
watches    0
dtype: int64

Using count to count the non-NaN values

In [74]:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())


Number of non-NaN values in the columns of our DataFrame:
 bikes      3
glasses    2
pants      3
shirts     2
shoes      3
suits      2
watches    3
dtype: int64


dropna would eliminate NaN values, axis = 0 indicates removing at the row level

In [75]:
# We drop any rows with NaN values
store_items.dropna(axis = 0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 2,15,50.0,5,2.0,5,7.0,10


same thing with dropna, but this time at the axis = 1, column level

In [76]:
# We drop any columns with NaN values
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,shoes,watches
store 1,20,30,8,35
store 2,15,5,5,10
store 3,20,30,10,35


Filling NaNs with zeros using the fillna function

In [77]:
# We replace all NaN values with 0
store_items.fillna(0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,0.0,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,0.0,10,0.0,35


Replace NaNs with previous values using forward filling (ffill). by column as axis = 1

In [80]:
print(store_items)
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,2.0,10,7.0,35


Replace NaNs with previous values using forward filling (ffill). by rows as axis = 0

In [83]:
print(store_items)
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20.0,20.0,30.0,15.0,8.0,45.0,35.0
store 2,15.0,50.0,5.0,2.0,5.0,7.0,10.0
store 3,20.0,4.0,30.0,30.0,10.0,10.0,35.0


backward filling at the column level

In [84]:
print(store_items)
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,50.0,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,10,,35


backward filling at the row level

In [85]:
print(store_items)
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20.0,30.0,30.0,15.0,8.0,45.0,35.0
store 2,15.0,50.0,5.0,2.0,5.0,7.0,10.0
store 3,20.0,4.0,30.0,10.0,10.0,35.0,35.0


we can also interpolate, using linear interpolation. this is at the column level

In [89]:
print(store_items)
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,2.0,10,7.0,35


we can also interpolate, using linear interpolation. this is at the row level

In [91]:
print(store_items)
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)

         bikes  glasses  pants  shirts  shoes  suits  watches
store 1     20      NaN     30    15.0      8   45.0       35
store 2     15     50.0      5     2.0      5    7.0       10
store 3     20      4.0     30     NaN     10    NaN       35


Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20.0,25.0,30.0,15.0,8.0,45.0,35.0
store 2,15.0,50.0,5.0,2.0,5.0,7.0,10.0
store 3,20.0,4.0,30.0,20.0,10.0,22.5,35.0


In [95]:
#Question:
import pandas as pd
import numpy as np

# Since we will be working with ratings, we will set the precision of our 
# dataframes to one decimal place.
pd.set_option('precision', 1)

# Create a Pandas DataFrame that contains the ratings some users have given to a
# series of books. The ratings given are in the range from 1 to 5, with 5 being
# the best score. The names of the books, the authors, and the ratings of each user
# are given below:

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])

user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Users that have np.nan values means that the user has not yet rated that book.
# Use the data above to create a Pandas DataFrame that has the following column
# labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
# automatically assign numerical row indices to the DataFrame. 

# Create a dictionary with the data given above
dat = {'Author': authors,
       'Book Title': books, 
       'User 1': user_1, 
       'User 2': user_2, 
       'User 3': user_3, 
       'User 4': user_4}

# Use the dictionary to create a Pandas DataFrame
book_ratings = pd.DataFrame(data = dat)

# If you created the dictionary correctly you should have a Pandas DataFrame
# that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
# 'User 4' and row indices 0 through 4.
print(book_ratings)

# Now replace all the NaN values in your DataFrame with the average rating in
# each column. Replace the NaN values in place. HINT: you can use the fillna()
# function with the keyword inplace = True, to do this. Write your code below:
print(book_ratings.fillna(book_ratings.mean()))

book_ratings.fillna(book_ratings.mean(), inplace = True)

                Author           Book Title  User 1  User 2  User 3  User 4
0      Charles Dickens   Great Expectations     3.2     5.0     2.0     4.0
1       John Steinbeck      Of Mice and Men     NaN     1.3     2.3     3.5
2  William Shakespeare     Romeo and Juliet     2.5     4.0     NaN     4.0
3          H. G. Wells     The Time Machine     NaN     3.8     4.0     5.0
4        Lewis Carroll  Alice in Wonderland     NaN     NaN     NaN     4.2
                Author           Book Title  User 1  User 2  User 3  User 4
0      Charles Dickens   Great Expectations     3.2     5.0     2.0     4.0
1       John Steinbeck      Of Mice and Men     2.9     1.3     2.3     3.5
2  William Shakespeare     Romeo and Juliet     2.5     4.0     2.8     4.0
3          H. G. Wells     The Time Machine     2.9     3.8     4.0     5.0
4        Lewis Carroll  Alice in Wonderland     2.9     3.5     2.8     4.2


In [None]:
#Answer
import pandas as pd
import numpy as np

pd.set_option('precision', 1)

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])
user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

dat = {'Book Title' : books,
       'Author' : authors,
       'User 1' : user_1,
       'User 2' : user_2,
       'User 3' : user_3,
       'User 4' : user_4}

book_ratings = pd.DataFrame(dat)

book_ratings.fillna(book_ratings.mean(), inplace = True)


Loading Data into Pandas DataFrame

In [None]:
# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./GOOG.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)

In [None]:
Google_stock

In [None]:
Google_stock.head()

In [None]:
Google_stock.tail()

In [None]:
Google_stock.isnull().any()

In [None]:
# We get descriptive statistics on our stock data
Google_stock.describe()

In [None]:
# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()

In [None]:
# We print information about our DataFrame  
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())

In [None]:
# We display the correlation between columns
Google_stock.corr()

In [None]:
# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')

data

In [None]:
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()

In [None]:
# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()

In [None]:
# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

In [None]:
# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()