<a href="https://colab.research.google.com/github/abhishekunivai/WSCourseDen/blob/main/Python-For-EDA/L2_Data_Transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise, we will learn how to merge dataframes, merging them on the index, concatenation along axes, combining/joining data with overlaps, reshaping and pivoting. 

We will also study various data cleaning techniques including removing duplicates, replacing values, renaming axis indexes, discretization and binning, detecting and filtering outliers. We will work on transforming data using a function, mapping, permutation and random sampling, and computing indicators/dummy variables. 

In [4]:
import pandas as pd
import numpy as np

# Combining dataframes



In [2]:
"""
In the dataset below, the first column contains information about student identifier and 
the second column contains their respective scores in any subject. 
The structure of the dataframes is same in both cases. We would need to concatenate both of them.
""" 
dataFrame1 =  pd.DataFrame({'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'Score' : [89, 39, 50, 97, 22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
dataFrame2 =  pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30], 'Score': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})



In [6]:
# We can do that by using Pandas concat() method. 

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)
dataframe.head(10)

Unnamed: 0,StudentID,Score
0,1,89
1,3,39
2,5,50
3,7,97
4,9,22
5,11,66
6,13,31
7,15,51
8,17,71
9,19,91


The argument ignore_index creates new index and its absense keeps the original indices. 
Note that we combined the dataframes along axis=0 which would combin together the dataframes along same direction. What if we want to combine both the dataframes side by side?

In [7]:
# Hint: Try to change value of "axis" argument
#Insert Your Code Here

#Solution: pd.concat([dataFrame1, dataFrame2], axis=1)

# Merging

Lets consider a case where we have two subjects and for each subject we have two datasets

In [8]:
df1SE =  pd.DataFrame({'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreSE' : [22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
df2SE =  pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30], 'ScoreSE': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})

df1ML =  pd.DataFrame({'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreML' : [39, 49, 55, 77, 52, 86, 41, 77, 73, 51, 86, 82, 92, 23, 49]})
df2ML =  pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'ScoreML': [93, 44, 78, 97, 87, 89, 39, 43, 88, 78]})

As you can see in the dataset above, you have two dataframes for each subjects. 

We need to concatenate SE & ML into one dataframe
There are multiple ways to complete the task, Let's go over some ways of doing it. 

In [None]:
# Option 1 - Try Concatenating df1SE & df2SE, df1ML & df2ML and then concatenate resulting dataframes together,
# no need to match student IDs across subjects


"""
Solution
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = pd.concat([dfML, dfSE], axis=1)
"""

Unnamed: 0,StudentID,ScoreML,StudentID.1,ScoreSE
0,1.0,39.0,9,22
1,3.0,49.0,11,66
2,5.0,55.0,13,31
3,7.0,77.0,15,51
4,9.0,52.0,17,71
5,11.0,86.0,19,91
6,13.0,41.0,21,56
7,15.0,77.0,23,32
8,17.0,73.0,25,52
9,19.0,51.0,27,73


In [None]:
# Option 2
"""
Here, you will perform inner join with each dataframe. 
That is to say, if an item exists on the both dataframe, will be included in the new dataframe.
This means, we will get the list of students who are appearing in both the courses.

Hint: Apply Inner Join using merge()
""" 

"""
Solution
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
df = dfSE.merge(dfML, how='inner')
"""



Unnamed: 0,StudentID,ScoreSE,ScoreML
0,9,22,52
1,11,66,86
2,13,31,41
3,15,51,77
4,17,71,73
5,19,91,51
6,21,56,86
7,23,32,82
8,25,52,92
9,27,73,23


In [None]:
# Option 3
"""
Here, you will perform left join with each dataframe. 
That is to say, if an item exists either in first dataframe or on the both dataframe, will be included in the new dataframe.

Hint: Apply Left Join using merge()
""" 

"""
Solution
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='left')
"""

Unnamed: 0,StudentID,ScoreSE,ScoreML
0,9,22,52.0
1,11,66,86.0
2,13,31,41.0
3,15,51,77.0
4,17,71,73.0
5,19,91,51.0
6,21,56,86.0
7,23,32,82.0
8,25,52,92.0
9,27,73,23.0


In [None]:
# Option 4
"""
Here, you will perform Right join with each dataframe. 
That is to say, if an item exists either in second dataframe or on the both dataframe, will be included in the new dataframe.

Hint: Apply Right Join using merge()
""" 

"""
Solution
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='right')
"""

Unnamed: 0,StudentID,ScoreSE,ScoreML
0,1,,39
1,3,,49
2,5,,55
3,7,,77
4,9,22.0,52
5,11,66.0,86
6,13,31.0,41
7,15,51.0,77
8,17,71.0,73
9,19,91.0,51


In [None]:
# Option 5
"""
Here, you will perform outer join with each dataframe. 
That is to say, if an item exists in either of the dataframes, will be included in the new dataframe.

Hint: Apply Outer Join using merge()
""" 

"""
Solution
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = dfSE.merge(dfML, how='outer')
"""

Unnamed: 0,StudentID,ScoreSE,ScoreML
0,9,22.0,52.0
1,11,66.0,86.0
2,13,31.0,41.0
3,15,51.0,77.0
4,17,71.0,73.0
5,19,91.0,51.0
6,21,56.0,86.0
7,23,32.0,82.0
8,25,52.0,92.0
9,27,73.0,23.0


In [None]:
df = pd.read_csv("https://drive.google.com/uc?id=1fh7UdeQO0p_yWUAKcCinv9dElxaWQ5Ya")
df.head(10)

Unnamed: 0.1,Unnamed: 0,Account,Company,Order,SKU,Country,Year,Quantity,UnitPrice,transactionComplete
0,0,123456779,Kulas Inc,99985,s9-supercomputer,Aruba,1981,5148,545,False
1,1,123456784,GitHub,99986,s4-supercomputer,Brazil,2001,3262,383,False
2,2,123456782,Kulas Inc,99990,s10-supercomputer,Montserrat,1973,9119,407,True
3,3,123456783,My SQ Man,99999,s1-supercomputer,El Salvador,2015,3097,615,False
4,4,123456787,ABC Dogma,99996,s6-supercomputer,Poland,1970,3356,91,True
5,5,123456778,Super Sexy Dingo,99996,s9-supercomputer,Costa Rica,2004,2474,136,True
6,6,123456783,ABC Dogma,99981,s11-supercomputer,Spain,2006,4081,195,False
7,7,123456785,ABC Dogma,99998,s9-supercomputer,Belarus,2015,6576,603,False
8,8,123456778,Loolo INC,99997,s8-supercomputer,Mauritius,1999,2460,36,False
9,9,123456775,Kulas Inc,99997,s7-supercomputer,French Guiana,2004,1831,664,True


In [None]:
#Task: Add new colum that is the total price based on the product of quantity and the unit price

"""
Solution
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head(10)
"""

Unnamed: 0,Account,Company,Order,SKU,Country,Year,Quantity,UnitPrice,transactionComplete,TotalPrice
0,123456779,Kulas Inc,99985,s9-supercomputer,Aruba,1981,5148,545,False,2805660
1,123456784,GitHub,99986,s4-supercomputer,Brazil,2001,3262,383,False,1249346
2,123456782,Kulas Inc,99990,s10-supercomputer,Montserrat,1973,9119,407,True,3711433
3,123456783,My SQ Man,99999,s1-supercomputer,El Salvador,2015,3097,615,False,1904655
4,123456787,ABC Dogma,99996,s6-supercomputer,Poland,1970,3356,91,True,305396
5,123456778,Super Sexy Dingo,99996,s9-supercomputer,Costa Rica,2004,2474,136,True,336464
6,123456783,ABC Dogma,99981,s11-supercomputer,Spain,2006,4081,195,False,795795
7,123456785,ABC Dogma,99998,s9-supercomputer,Belarus,2015,6576,603,False,3965328
8,123456778,Loolo INC,99997,s8-supercomputer,Mauritius,1999,2460,36,False,88560
9,123456775,Kulas Inc,99997,s7-supercomputer,French Guiana,2004,1831,664,True,1215784


In [None]:
df['Company'].value_counts()

My SQ Man                   869
Kirlosker Service Center    863
Will LLC                    862
ABC Dogma                   848
Kulas Inc                   840
Gen Power                   836
Name IT                     836
Super Sexy Dingo            828
GitHub                      823
Loolo INC                   822
SAS Web Tec                 798
Pryianka Ji                 775
Name: Company, dtype: int64

In [None]:
df.describe()

Unnamed: 0,Account,Order,Year,Quantity,UnitPrice,TotalPrice
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,123456800.0,99989.5629,1994.6198,4985.4473,355.8666,1773301.0
std,5.741156,5.905551,14.432771,2868.949686,201.378478,1540646.0
min,123456800.0,99980.0,1970.0,0.0,10.0,0.0
25%,123456800.0,99985.0,1982.0,2505.75,181.0,500337.0
50%,123456800.0,99990.0,1995.0,4994.0,356.0,1335698.0
75%,123456800.0,99995.0,2007.0,7451.5,531.0,2711653.0
max,123456800.0,99999.0,2019.0,9999.0,700.0,6841580.0


## Reshaping with Hierarchical Indexing

In [9]:
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Lucknow', 'Delhi', 'Bengaluru', 'Chennai', 'Kolkata'])
dframe1

Unnamed: 0,Lucknow,Delhi,Bengaluru,Chennai,Kolkata
Rainfall,0,1,2,3,4
Humidity,5,6,7,8,9
Wind,10,11,12,13,14


In [10]:
stacked = dframe1.stack()
stacked

Rainfall  Lucknow       0
          Delhi         1
          Bengaluru     2
          Chennai       3
          Kolkata       4
Humidity  Lucknow       5
          Delhi         6
          Bengaluru     7
          Chennai       8
          Kolkata       9
Wind      Lucknow      10
          Delhi        11
          Bengaluru    12
          Chennai      13
          Kolkata      14
dtype: int64

In [11]:
stacked.unstack()

Unnamed: 0,Lucknow,Delhi,Bengaluru,Chennai,Kolkata
Rainfall,0,1,2,3,4
Humidity,5,6,7,8,9
Wind,10,11,12,13,14


In [None]:
series1 = pd.Series([000, 111, 222, 333], index=['zeros','ones', 'twos', 'threes'])
series2 = pd.Series([444, 555, 666], index=['fours', 'fives', 'sixs'])

frame2 = pd.concat([series1, series2], keys=['Number1', 'Number2'])
print(frame2)

"""
How would frame2 look if we try to unstack it?
Try to use unstack() on frame2 
"""

"""
Solution
frame2.unstack()
"""

Number1  zeros       0
         ones      111
         twos      222
         threes    333
Number2  fours     444
         fives     555
         sixs      666
dtype: int64


Unnamed: 0,fives,fours,ones,sixs,threes,twos,zeros
Number1,,,111.0,,333.0,222.0,0.0
Number2,555.0,444.0,,666.0,,,


# Data deduplication

In [22]:
frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 10, 22, 23, 23, 24, 24]})
frame3

Unnamed: 0,column 1,column 2
0,Looping,10
1,Looping,10
2,Looping,22
3,Functions,23
4,Functions,23
5,Functions,24
6,Functions,24


In [23]:
"""
Exercise: Check for duplicates in frame3 without dropping them.
"""

"""
Solution
frame3.duplicated()
"""

'\nSolution\nframe3.duplicated()\n'

In [24]:
frame4 = frame3.drop_duplicates()
frame4

Unnamed: 0,column 1,column 2
0,Looping,10
2,Looping,22
3,Functions,23
5,Functions,24


In [25]:
frame3['column 3'] = range(7)
#Let's see the result if we try to drop duplicates 
frame5 = frame3.drop_duplicates()
frame5

Unnamed: 0,column 1,column 2,column 3
0,Looping,10,0
1,Looping,10,1
2,Looping,22,2
3,Functions,23,3
4,Functions,23,4
5,Functions,24,5
6,Functions,24,6


In [26]:
# We can drop rows by passing in column to be checked for duplicate values
frame3['column 3'] = range(7)
frame5 = frame3.drop_duplicates(['column 2'])
frame5

Unnamed: 0,column 1,column 2,column 3
0,Looping,10,0
2,Looping,22,2
3,Functions,23,3
5,Functions,24,5


In [27]:
"""
Exercise: Drop all duplicates from frame3 using bool series
"""
frame3['column 3'] = range(7)

"""
Solution:
bool_series = frame3["column 2"].duplicated(keep = False)
print(bool_series)
frame5 = frame3[~bool_series]
print(frame5)
"""

0     True
1     True
2    False
3     True
4     True
5     True
6     True
Name: column 2, dtype: bool
  column 1  column 2  column 3
2  Looping        22         2


In [30]:
"""
Exercise: Keep duplicates from frame3 using bool series, 
          drop the first occurance of duplicate values
"""
frame3['column 3'] = range(7)

"""
Solution
bool_series = frame3["column 2"].duplicated(keep ='last')
print(bool_series)
frame5 = frame3[~bool_series]
print(frame5)
"""

0     True
1    False
2    False
3     True
4    False
5     True
6    False
Name: column 2, dtype: bool
    column 1  column 2  column 3
1    Looping        10         1
2    Looping        22         2
4  Functions        23         4
6  Functions        24         6


# Replacing values

In [None]:
import numpy as np


In [None]:
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 332., 3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)


Unnamed: 0,column 1,column 2
0,200.0,0
1,3000.0,1
2,,2
3,3000.0,3
4,234.0,4
5,444.0,5
6,,6
7,332.0,7
8,3332.0,8


In [None]:
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 332., 3332. ], 'column 2': range(9)})
"""
Exercise: Replace -786 with NaN & 0 with 2
"""

"""
Solution
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])
"""

Unnamed: 0,column 1,column 2
0,200.0,2
1,3000.0,1
2,,2
3,3000.0,3
4,234.0,4
5,444.0,5
6,,6
7,332.0,7
8,3332.0,8


# Handling missing data

In [5]:
data = np.arange(15, 30).reshape(5, 3)
df_store = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['store1', 'store2', 'store3'])
df_store

Unnamed: 0,store1,store2,store3
apple,15,16,17
banana,18,19,20
kiwi,21,22,23
grapes,24,25,26
mango,27,28,29


In [6]:
df_store['store4'] = np.nan
df_store.loc['watermelon'] = np.arange(15, 19)
df_store.loc['oranges'] = np.nan
df_store['store5'] = np.nan
df_store['store4']['apple'] = 20.
df_store

Unnamed: 0,store1,store2,store3,store4,store5
apple,15.0,16.0,17.0,20.0,
banana,18.0,19.0,20.0,,
kiwi,21.0,22.0,23.0,,
grapes,24.0,25.0,26.0,,
mango,27.0,28.0,29.0,,
watermelon,15.0,16.0,17.0,18.0,
oranges,,,,,


In [None]:
df_store.isnull()

Unnamed: 0,store1,store2,store3,store4,store5
apple,False,False,False,False,True
banana,False,False,False,True,True
kiwi,False,False,False,True,True
grapes,False,False,False,True,True
mango,False,False,False,True,True
watermelon,False,False,False,False,True
oranges,True,True,True,True,True


In [None]:
"""
Exercise: Check for non null values in df_store
"""

"""
Solution
df_store.notnull()
"""

Unnamed: 0,store1,store2,store3,store4,store5
apple,True,True,True,True,False
banana,True,True,True,False,False
kiwi,True,True,True,False,False
grapes,True,True,True,False,False
mango,True,True,True,False,False
watermelon,True,True,True,True,False
oranges,False,False,False,False,False


In [None]:
df_store.isnull().sum()

store1    1
store2    1
store3    1
store4    5
store5    7
dtype: int64

In [None]:
"""
Exercise: Return sum of ALL null values in df_store
"""

"""
Solution
df_store.isnull().sum().sum()
"""

15

In [None]:
df_store.count()

store1    6
store2    6
store3    6
store4    2
store5    0
dtype: int64

In [None]:
"""
Exercise: Return ALL non null values in "store4" of df_store
"""

"""
Solution
df_store.store4[df_store.store4.notnull()]
"""

apple         20.0
watermelon    18.0
Name: store4, dtype: float64

In [None]:
df_store.store4.dropna()


apple         20.0
watermelon    18.0
Name: store4, dtype: float64

In [None]:
df_store.dropna()

Unnamed: 0,store1,store2,store3,store4,store5


In [None]:
"""
Exercise: Drop rows with ALL null values in df_store
"""

"""
Solution
df_store.dropna(how='all')
"""

Unnamed: 0,store1,store2,store3,store4,store5
apple,15.0,16.0,17.0,20.0,
banana,18.0,19.0,20.0,,
kiwi,21.0,22.0,23.0,,
grapes,24.0,25.0,26.0,,
mango,27.0,28.0,29.0,,
watermelon,15.0,16.0,17.0,18.0,


In [None]:
"""
Exercise: Drop columns with ALL null values in df_store
"""

"""
Solution
df_store.dropna(how='all', axis=1)
"""

Unnamed: 0,store1,store2,store3,store4
apple,15.0,16.0,17.0,20.0
banana,18.0,19.0,20.0,
kiwi,21.0,22.0,23.0,
grapes,24.0,25.0,26.0,
mango,27.0,28.0,29.0,
watermelon,15.0,16.0,17.0,18.0
oranges,,,,


In [None]:
df_store2 = df_store.copy()
df_store2.loc['oranges'].store1 = 0
df_store2.loc['oranges'].store3 = 0
df_store2

Unnamed: 0,store1,store2,store3,store4,store5
apple,15.0,16.0,17.0,20.0,
banana,18.0,19.0,20.0,,
kiwi,21.0,22.0,23.0,,
grapes,24.0,25.0,26.0,,
mango,27.0,28.0,29.0,,
watermelon,15.0,16.0,17.0,18.0,
oranges,0.0,,0.0,,


In [None]:
df_store2.dropna(how='any', axis=1)

Unnamed: 0,store1,store3
apple,15.0,17.0
banana,18.0,20.0
kiwi,21.0,23.0
grapes,24.0,26.0
mango,27.0,29.0
watermelon,15.0,17.0
oranges,0.0,0.0


In [None]:
"""
Exercise: Drop columns with alteast 5 null values 
"""

"""
Solution
df_store.dropna(thresh=5, axis=1)
"""

Unnamed: 0,store1,store2,store3
apple,15.0,16.0,17.0
banana,18.0,19.0,20.0
kiwi,21.0,22.0,23.0
grapes,24.0,25.0,26.0
mango,27.0,28.0,29.0
watermelon,15.0,16.0,17.0
oranges,,,


## NaN values in mathematical operations

In [None]:
ar1 = np.array([100, 200, np.nan, 300])
ser1 = pd.Series(ar1)

ar1.mean(), ser1.mean()

(nan, 200.0)

In [None]:
ser2 = df_store.store4
ser2.sum()

38.0

In [None]:
ser2.mean()

19.0

In [None]:
ser2.cumsum()

apple         20.0
banana         NaN
kiwi           NaN
grapes         NaN
mango          NaN
watermelon    38.0
oranges        NaN
Name: store4, dtype: float64

In [None]:
"""
Exercise: Add 1 to all non null values of store4 in df_store
"""

"""
Solution
df_store.store4 + 1
"""

apple         21.0
banana         NaN
kiwi           NaN
grapes         NaN
mango          NaN
watermelon    19.0
oranges        NaN
Name: store4, dtype: float64

## Filling in missing data


In [None]:
filledDf = df_store.fillna(0)
filledDf

Unnamed: 0,store1,store2,store3,store4,store5
apple,15.0,16.0,17.0,20.0,0.0
banana,18.0,19.0,20.0,0.0,0.0
kiwi,21.0,22.0,23.0,0.0,0.0
grapes,24.0,25.0,26.0,0.0,0.0
mango,27.0,28.0,29.0,0.0,0.0
watermelon,15.0,16.0,17.0,18.0,0.0
oranges,0.0,0.0,0.0,0.0,0.0


In [None]:
df_store.mean()

store1    20.0
store2    21.0
store3    22.0
store4    19.0
store5     NaN
dtype: float64

In [None]:
"""
Exercise: Calculate mean of all stores in filledDf
"""

"""
Solution
filledDf.mean()
"""

store1    17.142857
store2    18.000000
store3    18.857143
store4     5.428571
store5     0.000000
dtype: float64

## Forward and backward filling of the missing values

In [None]:
df_store.store4.fillna(method='ffill')

apple         20.0
banana        20.0
kiwi          20.0
grapes        20.0
mango         20.0
watermelon    18.0
oranges       18.0
Name: store4, dtype: float64

In [None]:
df_store.store4.fillna(method='bfill')

apple         20.0
banana        18.0
kiwi          18.0
grapes        18.0
mango         18.0
watermelon    18.0
oranges        NaN
Name: store4, dtype: float64

## Filling with index labels


In [None]:
to_fill = pd.Series([14, 23, 12], index=['apple', 'mango', 'oranges'])
to_fill

apple      14
mango      23
oranges    12
dtype: int64

In [1]:
df_store.store4.fillna(to_fill)


'\nSolution\ndf_store.store4.fillna(to_fill)\n'

In [8]:
df_store.mean()

store1    20.0
store2    21.0
store3    22.0
store4    19.0
store5     NaN
dtype: float64

In [7]:
"""
Exercise: Use mean of each store in df_store to populate respective null values in the columns
"""
df_store.fillna(df_store.mean())

Unnamed: 0,store1,store2,store3,store4,store5
apple,15.0,16.0,17.0,20.0,
banana,18.0,19.0,20.0,19.0,
kiwi,21.0,22.0,23.0,19.0,
grapes,24.0,25.0,26.0,19.0,
mango,27.0,28.0,29.0,19.0,
watermelon,15.0,16.0,17.0,18.0,
oranges,20.0,21.0,22.0,19.0,


## Interpolation of missing values

In [None]:
ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])
ser3.interpolate()

0    100.0
1    148.0
2    196.0
3    244.0
4    292.0
dtype: float64

In [14]:
from datetime import datetime
ts = pd.Series([10, np.nan, np.nan, 9], 
               index=[datetime(2019, 1,1), 
                      datetime(2019, 2,1), 
                      datetime(2019, 3,1),
                      datetime(2019, 5,1)])

ts

2019-01-01    10.0
2019-02-01     NaN
2019-03-01     NaN
2019-05-01     9.0
dtype: float64

In [15]:
"""
Exercise: Use interpolate() to fill ts
"""

"""
Solution
ts.interpolate()
"""

2019-01-01    10.000000
2019-02-01     9.666667
2019-03-01     9.333333
2019-05-01     9.000000
dtype: float64

In [16]:
ts.interpolate(method='time')

2019-01-01    10.000000
2019-02-01     9.741667
2019-03-01     9.508333
2019-05-01     9.000000
dtype: float64

# Renaming axis indexes

In [None]:
import numpy as np
import pandas as pd
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Lucknow', 'Delhi', 'Bengaluru', 'Chennai', 'Kolkata'])
dframe1

Unnamed: 0,Lucknow,Delhi,Bengaluru,Chennai,Kolkata
Rainfall,0,1,2,3,4
Humidity,5,6,7,8,9
Wind,10,11,12,13,14


In [None]:
# Say, you want to transform the index terms to capital letter. 

dframe1.index = dframe1.index.map(str.upper)
dframe1

Unnamed: 0,Lucknow,Delhi,Bengaluru,Chennai,Kolkata
RAINFALL,0,1,2,3,4
HUMIDITY,5,6,7,8,9
WIND,10,11,12,13,14


In [None]:
"""
Exercise: Rename dframe1 index to Title Case & Column Names to Capital/Upper Case
"""

"""
Solution
dframe1.rename(index=str.title, columns=str.upper)
"""

Unnamed: 0,LUCKNOW,DELHI,BENGALURU,CHENNAI,KOLKATA
Rainfall,0,1,2,3,4
Humidity,5,6,7,8,9
Wind,10,11,12,13,14


# Discretization and binning


In [None]:
import pandas as pd

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

bins = [118, 125, 135, 160, 200]

category = pd.cut(height, bins)

category

[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64, right]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]

In [None]:
"""
Exercise: Get frequency of items in every bin
Hint use value_counts()
"""

"""
Solution
pd.value_counts(category)
"""

(118, 125]    5
(125, 135]    3
(135, 160]    3
(160, 200]    1
dtype: int64

In [None]:
"""
Exercise: Change the default behaviour of pd.cut() to include 
right value of bin and exclude left 
This would be inverse of its natural behaviour of including left limit and exclude right limit
"""

"""
Solution
category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)

category2
"""


[[118, 126), [118, 126), [118, 126), [126, 136), [118, 126), ..., [126, 136), [161, 200), [136, 161), [136, 161), [126, 136)]
Length: 12
Categories (4, interval[int64, left]): [[118, 126) < [126, 136) < [136, 161) < [161, 200)]

In [None]:
bin_names = ['Short Height', 'Averge height', 'Good Height', 'Taller']
pd.cut(height, bins, labels=bin_names)

['Short Height', 'Short Height', 'Short Height', 'Averge height', 'Short Height', ..., 'Averge height', 'Taller', 'Good Height', 'Good Height', 'Averge height']
Length: 12
Categories (4, object): ['Short Height' < 'Averge height' < 'Good Height' < 'Taller']

In [17]:
# Number of bins as integer
import numpy as np

pd.cut(np.random.rand(40), 5, precision=2)


[(0.79, 0.99], (0.4, 0.6], (0.4, 0.6], (0.79, 0.99], (0.79, 0.99], ..., (0.0098, 0.21], (0.4, 0.6], (0.21, 0.4], (0.6, 0.79], (0.6, 0.79]]
Length: 40
Categories (5, interval[float64, right]): [(0.0098, 0.21] < (0.21, 0.4] < (0.4, 0.6] < (0.6, 0.79] <
                                           (0.79, 0.99]]

In [29]:
"""
Qcut (quantile-cut) differs from cut in the sense that, in qcut,
the number of elements in each bin will be roughly the same,
but this will come at the cost of differently sized interval widths.
""" 
randomNumbers = np.random.rand(50)
category3 = pd.qcut(randomNumbers, 4) # cut into quartiles
category3

[(0.29, 0.538], (0.538, 0.783], (0.00537, 0.29], (0.538, 0.783], (0.538, 0.783], ..., (0.29, 0.538], (0.00537, 0.29], (0.783, 0.989], (0.00537, 0.29], (0.29, 0.538]]
Length: 50
Categories (4, interval[float64, right]): [(0.00537, 0.29] < (0.29, 0.538] < (0.538, 0.783] <
                                           (0.783, 0.989]]

In [30]:
pd.value_counts(category3)

(0.00537, 0.29]    13
(0.783, 0.989]     13
(0.29, 0.538]      12
(0.538, 0.783]     12
dtype: int64

In [35]:
pd.qcut(randomNumbers, [0, 0.3, 0.5, 0.7, 1.0])

[(0.315, 0.538], (0.538, 0.774], (0.00537, 0.315], (0.538, 0.774], (0.538, 0.774], ..., (0.315, 0.538], (0.00537, 0.315], (0.774, 0.989], (0.00537, 0.315], (0.315, 0.538]]
Length: 50
Categories (4, interval[float64, right]): [(0.00537, 0.315] < (0.315, 0.538] < (0.538, 0.774] <
                                           (0.774, 0.989]]

In [36]:
pd.value_counts(pd.qcut(randomNumbers, [0, 0.3, 0.5, 0.7, 1.0]))

(0.00537, 0.315]    15
(0.774, 0.989]      15
(0.315, 0.538]      10
(0.538, 0.774]      10
dtype: int64

In [None]:
df = pd.read_csv("https://drive.google.com/uc?id=1fh7UdeQO0p_yWUAKcCinv9dElxaWQ5Ya")
df.head(10)

Unnamed: 0,Account,Company,Order,SKU,Country,Year,Quantity,UnitPrice,transactionComplete
0,123456779,Kulas Inc,99985,s9-supercomputer,Aruba,1981,5148,545,False
1,123456784,GitHub,99986,s4-supercomputer,Brazil,2001,3262,383,False
2,123456782,Kulas Inc,99990,s10-supercomputer,Montserrat,1973,9119,407,True
3,123456783,My SQ Man,99999,s1-supercomputer,El Salvador,2015,3097,615,False
4,123456787,ABC Dogma,99996,s6-supercomputer,Poland,1970,3356,91,True
5,123456778,Super Sexy Dingo,99996,s9-supercomputer,Costa Rica,2004,2474,136,True
6,123456783,ABC Dogma,99981,s11-supercomputer,Spain,2006,4081,195,False
7,123456785,ABC Dogma,99998,s9-supercomputer,Belarus,2015,6576,603,False
8,123456778,Loolo INC,99997,s8-supercomputer,Mauritius,1999,2460,36,False
9,123456775,Kulas Inc,99997,s7-supercomputer,French Guiana,2004,1831,664,True


In [None]:
df.describe()

(10000, 9)


Unnamed: 0,Account,Order,Year,Quantity,UnitPrice
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,123456800.0,99989.5629,1994.6198,4985.4473,355.8666
std,5.741156,5.905551,14.432771,2868.949686,201.378478
min,123456800.0,99980.0,1970.0,0.0,10.0
25%,123456800.0,99985.0,1982.0,2505.75,181.0
50%,123456800.0,99990.0,1995.0,4994.0,356.0
75%,123456800.0,99995.0,2007.0,7451.5,531.0
max,123456800.0,99999.0,2019.0,9999.0,700.0


In [None]:
# Creating total price column using UnitPrice and Quantity
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head(10)

Unnamed: 0,Account,Company,Order,SKU,Country,Year,Quantity,UnitPrice,transactionComplete,TotalPrice
0,123456779,Kulas Inc,99985,s9-supercomputer,Aruba,1981,5148,545,False,2805660
1,123456784,GitHub,99986,s4-supercomputer,Brazil,2001,3262,383,False,1249346
2,123456782,Kulas Inc,99990,s10-supercomputer,Montserrat,1973,9119,407,True,3711433
3,123456783,My SQ Man,99999,s1-supercomputer,El Salvador,2015,3097,615,False,1904655
4,123456787,ABC Dogma,99996,s6-supercomputer,Poland,1970,3356,91,True,305396
5,123456778,Super Sexy Dingo,99996,s9-supercomputer,Costa Rica,2004,2474,136,True,336464
6,123456783,ABC Dogma,99981,s11-supercomputer,Spain,2006,4081,195,False,795795
7,123456785,ABC Dogma,99998,s9-supercomputer,Belarus,2015,6576,603,False,3965328
8,123456778,Loolo INC,99997,s8-supercomputer,Mauritius,1999,2460,36,False,88560
9,123456775,Kulas Inc,99997,s7-supercomputer,French Guiana,2004,1831,664,True,1215784


In [None]:
"""
Exercise: Find transaction exceeded 3000000
"""

"""
Solution
TotalTransaction = df["TotalPrice"]
TotalTransaction[np.abs(TotalTransaction) > 3000000]
"""

2       3711433
7       3965328
13      4758900
15      5189372
17      3989325
         ...   
9977    3475824
9984    5251134
9987    5670420
9991    5735513
9996    3018490
Name: TotalPrice, Length: 2094, dtype: int64

In [None]:
df[np.abs(TotalTransaction) > 6741112]

Unnamed: 0,Account,Company,Order,SKU,Country,Year,Quantity,UnitPrice,transactionComplete,TotalPrice
818,123456781,Gen Power,99991,s1-supercomputer,Burkina Faso,1985,9693,696,False,6746328
1402,123456778,Will LLC,99985,s11-supercomputer,Austria,1990,9844,695,True,6841580
2242,123456770,Name IT,99997,s9-supercomputer,Myanmar,1979,9804,692,False,6784368
2876,123456772,Gen Power,99992,s10-supercomputer,Mali,2007,9935,679,False,6745865
3210,123456782,Loolo INC,99991,s8-supercomputer,Kuwait,2006,9886,692,False,6841112
3629,123456779,My SQ Man,99980,s3-supercomputer,Hong Kong,1994,9694,700,False,6785800
7674,123456781,Loolo INC,99989,s6-supercomputer,Sri Lanka,1994,9882,691,False,6828462
8645,123456789,Gen Power,99996,s11-supercomputer,Suriname,2005,9742,699,False,6809658
8684,123456785,Gen Power,99989,s2-supercomputer,Kenya,2013,9805,694,False,6804670


# Permunation and Random sampling

In [None]:
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)

df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,1,2,3,4,5,6,7
1,8,9,10,11,12,13,14,15
2,16,17,18,19,20,21,22,23
3,24,25,26,27,28,29,30,31
4,32,33,34,35,36,37,38,39
5,40,41,42,43,44,45,46,47
6,48,49,50,51,52,53,54,55
7,56,57,58,59,60,61,62,63
8,64,65,66,67,68,69,70,71
9,72,73,74,75,76,77,78,79


In [None]:
sampler = np.random.permutation(10)
sampler

array([2, 3, 1, 7, 5, 8, 9, 0, 6, 4])

In [None]:
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6,7
2,16,17,18,19,20,21,22,23
3,24,25,26,27,28,29,30,31
1,8,9,10,11,12,13,14,15
7,56,57,58,59,60,61,62,63
5,40,41,42,43,44,45,46,47
8,64,65,66,67,68,69,70,71
9,72,73,74,75,76,77,78,79
0,0,1,2,3,4,5,6,7
6,48,49,50,51,52,53,54,55
4,32,33,34,35,36,37,38,39


In [None]:
# Random sample without replacement

df.take(np.random.permutation(len(df))[:3])

Unnamed: 0,0,1,2,3,4,5,6,7
6,48,49,50,51,52,53,54,55
2,16,17,18,19,20,21,22,23
3,24,25,26,27,28,29,30,31


In [None]:
# Random sample with replacement
sack = np.array([4, 8, -2, 7, 5])
sampler = np.random.randint(0, len(sack), size = 10)
sampler

array([2, 3, 4, 4, 1, 2, 1, 1, 2, 4])

In [None]:
"""
Exercise: Draw numbers from sack using sampler
"""

"""
Solution
draw = sack.take(sampler)
draw 
"""


array([-2,  7,  5,  5,  8, -2,  8,  8, -2,  5])

# Dummy variables

In [None]:
df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'votes': range(6, 12, 1)})
df

Unnamed: 0,gender,votes
0,female,6
1,female,7
2,male,8
3,unknown,9
4,male,10
5,female,11


In [None]:
pd.get_dummies(df['gender'])

Unnamed: 0,female,male,unknown
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,1,0
5,1,0,0


In [None]:
dummies = pd.get_dummies(df['gender'], prefix='gender')
dummies

Unnamed: 0,gender_female,gender_male,gender_unknown
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,1,0
5,1,0,0


In [None]:
"""
Exercise Join votes column with dummies dataframe
"""

"""
Solution
with_dummy = df[['votes']].join(dummies)
with_dummy
"""


Unnamed: 0,votes,gender_female,gender_male,gender_unknown
0,6,1,0,0
1,7,1,0,0
2,8,0,1,0
3,9,0,0,1
4,10,0,1,0
5,11,1,0,0
