![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Data Transformation Cheat Sheet

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Table of Content

1. Creating a Pandas Dataframe
    - Setting an Index
    - Sorting Dataframe
2. Modifying Datetime
    - Converting Year and Month
3. Summary Statistics
4. Column Wrangling & Feature Engineering
    - Modifying All
    - Modify the first appearance of Vision to year 1964
5. Pivoting Data
    - Wide
    - Long
6. Aggregating Data
    - Calculating Percentage Share
    - Calculating Unique Values
7. Selecting, Indexing, Filtering & Slicing
8. Filtering
    - Filtering Multiple Values
    - Retrieving Single and Multiple Columns
    - Filtering then summarizing
9. Modifying Dataframes
    - Renaming Columns
10. Appending Values
11. Cleaning Null Values
    - Filling Null Values on Dataframes
    - Backward and Forward filling
12. Cleaning All Values
13. Dealing w/ Duplicates
    - Duplicates in DataFrames
14. Changing Datatypes
15. Joining Data



![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Libraries

In [161]:
import pandas as pd
import numpy as np

In [162]:
pd.read_csv?

In [163]:
# Please see the Data Ingestion Cheat sheet to learn how to load data
sales = pd.read_csv(
    'data/sales_data.csv',
    parse_dates=['Date']
)

In [164]:
census_county = pd.read_csv("data/census_county.csv")

new = census_county["NAME"].str.split(", ", expand = True)
census_county["county_name"] = new[0]
census_county["state_name"] = new[1]

census_county.drop(columns=['NAME'], inplace=True)
census_county.head()

Unnamed: 0,state,county,median_income,population,year,county_name,state_name
0,37,43,36711.0,10506,2011,Clay County,North Carolina
1,37,51,44861.0,316478,2011,Cumberland County,North Carolina
2,37,81,46288.0,483081,2011,Guilford County,North Carolina
3,37,99,36826.0,39574,2011,Jackson County,North Carolina
4,37,139,45298.0,40511,2011,Pasquotank County,North Carolina


The read_csv function is extremely powerful and you can specify many more parameters at import time. We can achive the same results with only one line by doing:

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
# Creating a Pandas Dataframe


In [165]:
country_stat = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Setting the Index

In [166]:
country_stat.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [167]:
country_stat.sort_values(by=['Continent','Population','HDI'], ascending=False)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America
Canada,35.467,1785387,9984670,0.913,America


In [168]:
btc_price = pd.read_csv(
    'data/btc-market-price.csv',
    header=None,
    names=['Timestamp', 'Price'],
    index_col=0,
    parse_dates=True
)

In [169]:
btc_price.head()

Unnamed: 0_level_0,Price
Timestamp,Unnamed: 1_level_1
2017-04-02,1099.169125
2017-04-03,1141.813
2017-04-04,1141.600363
2017-04-05,1133.079314
2017-04-06,1196.307937


In [170]:
marvel_data = [
    ['Spider-Man', 'male', 1962],
    ['Captain America', 'male', 1941],
    ['Wolverine', 'male', 1974],
    ['Iron Man', 'male', 1963],
    ['Thor', 'male', 1963],
    ['Thing', 'male', 1961],
    ['Mister Fantastic', 'male', 1961],
    ['Hulk', 'male', 1962],
    ['Beast', 'male', 1963],
    ['Invisible Woman', 'female', 1961],
    ['Storm', 'female', 1975],
    ['Namor', 'male', 1939],
    ['Hawkeye', 'male', 1964],
    ['Daredevil', 'male', 1964],
    ['Doctor Strange', 'male', 1963],
    ['Hank Pym', 'male', 1962],
    ['Scarlet Witch', 'female', 1964],
    ['Wasp', 'female', 1963],
    ['Black Widow', 'female', 1964],
    ['Vision', 'male', 1968]
]

marvel_df = pd.DataFrame(data=marvel_data,
                         columns=['name', 'sex', 'first_appearance'])
marvel_df

Unnamed: 0,name,sex,first_appearance
0,Spider-Man,male,1962
1,Captain America,male,1941
2,Wolverine,male,1974
3,Iron Man,male,1963
4,Thor,male,1963
5,Thing,male,1961
6,Mister Fantastic,male,1961
7,Hulk,male,1962
8,Beast,male,1963
9,Invisible Woman,female,1961


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Modifying Datetime

In [171]:
btc_price.dtypes

Price    float64
dtype: object

In [172]:
btc_price = btc_price.reset_index()
btc_price['Timestamp'] = pd.to_datetime(btc_price['Timestamp'])
btc_price.dtypes

Timestamp    datetime64[ns]
Price               float64
dtype: object

In [173]:
btc_price.set_index('Timestamp', inplace=True)

In [174]:
btc_price.loc['2017-09-29':'2017-10-05']

Unnamed: 0_level_0,Price
Timestamp,Unnamed: 1_level_1
2017-09-29,4193.574667
2017-09-30,4335.368317
2017-10-01,4360.722967
2017-10-02,4386.88375
2017-10-03,4293.3066
2017-10-04,4225.175
2017-10-05,4338.852


In [175]:
import datetime

year = datetime.date.today().year

marvel_df['years_since'] = year - marvel_df['first_appearance']
marvel_df

Unnamed: 0,name,sex,first_appearance,years_since
0,Spider-Man,male,1962,60
1,Captain America,male,1941,81
2,Wolverine,male,1974,48
3,Iron Man,male,1963,59
4,Thor,male,1963,59
5,Thing,male,1961,61
6,Mister Fantastic,male,1961,61
7,Hulk,male,1962,60
8,Beast,male,1963,59
9,Invisible Woman,female,1961,61


## Dealing with Dates

In [176]:
sales['Year'] = sales['Date'].dt.year
sales['Month'] = sales['Date'].dt.month
sales

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2013-11-26,26,11,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
1,2015-11-26,26,11,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2,2014-03-23,23,3,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
3,2016-03-23,23,3,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
4,2014-05-15,15,5,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113031,2016-04-12,12,4,2016,41,Adults (35-64),M,United Kingdom,England,Clothing,Vests,"Classic Vest, S",3,24,64,112,72,184
113032,2014-04-02,2,4,2014,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113033,2016-04-02,2,4,2016,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113034,2014-03-04,4,3,2014,37,Adults (35-64),F,France,Seine (Paris),Clothing,Vests,"Classic Vest, L",24,24,64,684,576,1260


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Summary Statistics

In [177]:
sales.head()

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2013-11-26,26,11,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
1,2015-11-26,26,11,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2,2014-03-23,23,3,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
3,2016-03-23,23,3,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
4,2014-05-15,15,5,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418


In [178]:
sales.shape

(113036, 18)

In [179]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113036 entries, 0 to 113035
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Date              113036 non-null  datetime64[ns]
 1   Day               113036 non-null  int64         
 2   Month             113036 non-null  int64         
 3   Year              113036 non-null  int64         
 4   Customer_Age      113036 non-null  int64         
 5   Age_Group         113036 non-null  object        
 6   Customer_Gender   113036 non-null  object        
 7   Country           113036 non-null  object        
 8   State             113036 non-null  object        
 9   Product_Category  113036 non-null  object        
 10  Sub_Category      113036 non-null  object        
 11  Product           113036 non-null  object        
 12  Order_Quantity    113036 non-null  int64         
 13  Unit_Cost         113036 non-null  int64         
 14  Unit

In [180]:
sales.columns

Index(['Date', 'Day', 'Month', 'Year', 'Customer_Age', 'Age_Group',
       'Customer_Gender', 'Country', 'State', 'Product_Category',
       'Sub_Category', 'Product', 'Order_Quantity', 'Unit_Cost', 'Unit_Price',
       'Profit', 'Cost', 'Revenue'],
      dtype='object')

In [181]:
sales.describe()

Unnamed: 0,Day,Month,Year,Customer_Age,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
count,113036.0,113036.0,113036.0,113036.0,113036.0,113036.0,113036.0,113036.0,113036.0,113036.0
mean,15.665753,6.453024,2014.401739,35.919212,11.90166,267.296366,452.938427,285.051665,469.318695,754.37036
std,8.781567,3.478198,1.27251,11.021936,9.561857,549.835483,922.071219,453.887443,884.866118,1309.094674
min,1.0,1.0,2011.0,17.0,1.0,1.0,2.0,-30.0,1.0,2.0
25%,8.0,4.0,2013.0,28.0,2.0,2.0,5.0,29.0,28.0,63.0
50%,16.0,6.0,2014.0,35.0,10.0,9.0,24.0,101.0,108.0,223.0
75%,23.0,10.0,2016.0,43.0,20.0,42.0,70.0,358.0,432.0,800.0
max,31.0,12.0,2016.0,87.0,32.0,2171.0,3578.0,15096.0,42978.0,58074.0


In [182]:
sales.size

2034648

In [183]:
sales.dtypes

Date                datetime64[ns]
Day                          int64
Month                        int64
Year                         int64
Customer_Age                 int64
Age_Group                   object
Customer_Gender             object
Country                     object
State                       object
Product_Category            object
Sub_Category                object
Product                     object
Order_Quantity               int64
Unit_Cost                    int64
Unit_Price                   int64
Profit                       int64
Cost                         int64
Revenue                      int64
dtype: object

In [184]:
sales.dtypes.value_counts()

int64             10
object             7
datetime64[ns]     1
dtype: int64

In [185]:
sales['Unit_Cost'].describe()

count    113036.000000
mean        267.296366
std         549.835483
min           1.000000
25%           2.000000
50%           9.000000
75%          42.000000
max        2171.000000
Name: Unit_Cost, dtype: float64

In [186]:
sales['Unit_Cost'].mean()

267.296365759581

In [187]:
sales['Unit_Cost'].median()

9.0

In [188]:
sales['Unit_Cost'].min(), sales['Unit_Cost'].max()

(1, 2171)

In [189]:
sales['Unit_Cost'].std()

549.8354831077943

In [190]:
sales['Unit_Cost'].quantile(0.25)

2.0

In [191]:
sales['Unit_Cost'].quantile([.2, .4, .6, .8, 1])

0.2       2.0
0.4       7.0
0.6      13.0
0.8     344.0
1.0    2171.0
Name: Unit_Cost, dtype: float64

In [192]:
sales['Age_Group'].value_counts()

Adults (35-64)          55824
Young Adults (25-34)    38654
Youth (<25)             17828
Seniors (64+)             730
Name: Age_Group, dtype: int64

In [193]:
sales['Age_Group'].value_counts(normalize=True)

Adults (35-64)          0.493860
Young Adults (25-34)    0.341962
Youth (<25)             0.157720
Seniors (64+)           0.006458
Name: Age_Group, dtype: float64

In [194]:
country_group = sales.groupby(['Country'])
country_group['Age_Group'].value_counts(normalize=True).loc['Australia']

Age_Group
Adults (35-64)          0.434241
Young Adults (25-34)    0.380264
Youth (<25)             0.183072
Seniors (64+)           0.002423
Name: Age_Group, dtype: float64

In [195]:
sales

Unnamed: 0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
0,2013-11-26,26,11,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
1,2015-11-26,26,11,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2,2014-03-23,23,3,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
3,2016-03-23,23,3,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
4,2014-05-15,15,5,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113031,2016-04-12,12,4,2016,41,Adults (35-64),M,United Kingdom,England,Clothing,Vests,"Classic Vest, S",3,24,64,112,72,184
113032,2014-04-02,2,4,2014,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113033,2016-04-02,2,4,2016,18,Youth (<25),M,Australia,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183
113034,2014-03-04,4,3,2014,37,Adults (35-64),F,France,Seine (Paris),Clothing,Vests,"Classic Vest, L",24,24,64,684,576,1260


In [196]:
corr = sales.corr()
corr

Unnamed: 0,Day,Month,Year,Customer_Age,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
Day,1.0,0.014963,-0.007635,-0.014296,-0.002412,0.003133,0.003207,0.004623,0.003329,0.003853
Month,0.014963,1.0,-0.315359,-0.051234,0.028175,-0.021202,-0.021218,-0.002004,-0.0067,-0.005224
Year,-0.007635,-0.315359,1.0,0.040994,0.123169,-0.217575,-0.213673,-0.181525,-0.215604,-0.208673
Customer_Age,-0.014296,-0.051234,0.040994,1.0,0.026887,-0.021374,-0.020262,0.004319,-0.016013,-0.009326
Order_Quantity,-0.002412,0.028175,0.123169,0.026887,1.0,-0.515835,-0.515925,-0.238863,-0.340382,-0.312895
Unit_Cost,0.003133,-0.021202,-0.217575,-0.021374,-0.515835,1.0,0.997894,0.74102,0.829869,0.817865
Unit_Price,0.003207,-0.021218,-0.213673,-0.020262,-0.515925,0.997894,1.0,0.74987,0.826301,0.818522
Profit,0.004623,-0.002004,-0.181525,0.004319,-0.238863,0.74102,0.74987,1.0,0.902233,0.956572
Cost,0.003329,-0.0067,-0.215604,-0.016013,-0.340382,0.829869,0.826301,0.902233,1.0,0.988758
Revenue,0.003853,-0.005224,-0.208673,-0.009326,-0.312895,0.817865,0.818522,0.956572,0.988758,1.0


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Column Wrangling & Feature Engineering

In [197]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']
sales['Revenue_per_Age'].head()

0    50.000000
1    50.000000
2    49.000000
3    42.612245
4     8.893617
Name: Revenue_per_Age, dtype: float64

In [198]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']
sales['Calculated_Cost'].head()

0     360
1     360
2    1035
3     900
4     180
Name: Calculated_Cost, dtype: int64

In [199]:
(sales['Calculated_Cost'] != sales['Cost']).sum()

0

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Modifying All

In [200]:
sales['Calculated_Cost'] *= 1.03
sales['Calculated_Cost'].head()

0     370.80
1     370.80
2    1066.05
3     927.00
4     185.40
Name: Calculated_Cost, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Modify the `first appearance` of `Vision` to year 1964

In [201]:
marvel_df.loc['Vision', 'first_appearance'] = 1964
marvel_df

Unnamed: 0,name,sex,first_appearance,years_since
0,Spider-Man,male,1962.0,60.0
1,Captain America,male,1941.0,81.0
2,Wolverine,male,1974.0,48.0
3,Iron Man,male,1963.0,59.0
4,Thor,male,1963.0,59.0
5,Thing,male,1961.0,61.0
6,Mister Fantastic,male,1961.0,61.0
7,Hulk,male,1962.0,60.0
8,Beast,male,1963.0,59.0
9,Invisible Woman,female,1961.0,61.0


In [202]:
sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.10
sales.loc[sales['Country'] == 'France', 'Revenue']

50         865.7
51         865.7
52        3252.7
53        3136.1
60         688.6
           ...  
112979    1892.0
113000     405.9
113001     473.0
113034    1386.0
113035    1327.7
Name: Revenue, Length: 10998, dtype: float64

In [203]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

GDP   -1000000.0
HDI         -0.3
dtype: float64

In [204]:
crisis[['GDP', 'HDI']]

GDP   -1000000.0
HDI         -0.3
dtype: float64

In [205]:
crisis[['GDP', 'HDI']] + crisis

GDP   -2000000.0
HDI         -0.6
dtype: float64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
# Pivoting Data

## Long

In [206]:
d1 = {"Name": ["Pankaj", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]}
df_wide = pd.DataFrame(d1)
df_wide

Unnamed: 0,Name,ID,Role
0,Pankaj,1,CEO
1,Lisa,2,Editor
2,David,3,Author


In [207]:
df_long = pd.melt(df_wide, id_vars=["ID"], value_vars=["Name", "Role"], var_name="Attribute", value_name="Value")
df_long

Unnamed: 0,ID,Attribute,Value
0,1,Name,Pankaj
1,2,Name,Lisa
2,3,Name,David
3,1,Role,CEO
4,2,Role,Editor
5,3,Role,Author


## Wide

In [208]:
# https://beta.bls.gov/dataQuery/find?fq=survey:[ap]&s=popularity:D
# This data came from bls
df_long = pd.read_csv("data/file.csv")
df_long

Unnamed: 0,Series ID,Item,Year Month,Avg. Price ($)
0,APU0000702111,"Bread, whilte per lb",2020 Jan,1.351
1,APU0000702111,"Bread, whilte per lb",2020 Feb,1.375
2,APU0000702111,"Bread, whilte per lb",2020 Mar,1.374
3,APU0000702111,"Bread, whilte per lb",2020 Apr,1.406
4,APU0000702111,"Bread, whilte per lb",2020 May,1.412
...,...,...,...,...
97,APU0000709112,"Milk, Whole per gal",2022 Jun,4.153
98,APU0000709112,"Milk, Whole per gal",2022 Jul,4.156
99,APU0000709112,"Milk, Whole per gal",2022 Aug,4.194
100,APU0000709112,"Milk, Whole per gal",2022 Sep,4.181


In [209]:
# unmelting using pivot()
df_wide=pd.pivot(df_long, index=['Series ID','Item'], columns = 'Year Month',values = 'Avg. Price ($)') #Reshape from long to wide

df_wide

Unnamed: 0_level_0,Year Month,2020 Apr,2020 Aug,2020 Dec,2020 Feb,2020 Jan,2020 Jul,2020 Jun,2020 Mar,2020 May,2020 Nov,...,2022 Apr,2022 Aug,2022 Feb,2022 Jan,2022 Jul,2022 Jun,2022 Mar,2022 May,2022 Oct,2022 Sep
Series ID,Item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
APU0000702111,"Bread, whilte per lb",1.406,1.495,1.538,1.375,1.351,1.485,1.474,1.374,1.412,1.515,...,1.612,1.756,1.578,1.555,1.715,1.691,1.607,1.606,1.814,1.749
APU0000708111,"Eggs, large per doz",2.019,1.328,1.481,1.449,1.461,1.401,1.554,1.525,1.64,1.45,...,2.52,3.116,2.005,1.929,2.936,2.707,2.046,2.863,3.419,2.902
APU0000709112,"Milk, Whole per gal",3.267,3.406,3.535,3.196,3.253,3.255,3.198,3.248,3.21,3.425,...,4.012,4.194,3.875,3.787,4.156,4.153,3.917,4.204,4.184,4.181


## Long

In [210]:

year_list=list(df_wide.columns)
df_long = pd.melt(df_wide, value_vars=year_list,value_name='Avg. Price ($)', ignore_index=False).reset_index()
df_long

Unnamed: 0,Series ID,Item,Year Month,Avg. Price ($)
0,APU0000702111,"Bread, whilte per lb",2020 Apr,1.406
1,APU0000708111,"Eggs, large per doz",2020 Apr,2.019
2,APU0000709112,"Milk, Whole per gal",2020 Apr,3.267
3,APU0000702111,"Bread, whilte per lb",2020 Aug,1.495
4,APU0000708111,"Eggs, large per doz",2020 Aug,1.328
...,...,...,...,...
97,APU0000708111,"Eggs, large per doz",2022 Oct,3.419
98,APU0000709112,"Milk, Whole per gal",2022 Oct,4.184
99,APU0000702111,"Bread, whilte per lb",2022 Sep,1.749
100,APU0000708111,"Eggs, large per doz",2022 Sep,2.902


# Aggregating

## Calculating Percentage Share

In [211]:
sales_yr_age = sales.groupby(['Year','Age_Group']).agg({'Profit':'sum','Revenue':'sum'})
sales_yr_age

Unnamed: 0_level_0,Unnamed: 1_level_0,Profit,Revenue
Year,Age_Group,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,Adults (35-64),1278031,3964024.4
2011,Seniors (64+),9169,25363.0
2011,Young Adults (25-34),1085579,3448203.7
2011,Youth (<25),508522,1621959.3
2012,Adults (35-64),1336260,4144715.5
2012,Seniors (64+),7506,23230.0
2012,Young Adults (25-34),1108066,3503094.7
2012,Youth (<25),500161,1601158.1
2013,Adults (35-64),2737478,6959641.5
2013,Seniors (64+),17764,39987.0


In [212]:
sales_yr_age['Revenue_Share'] = sales_yr_age['Revenue'] / sales_yr_age.groupby(['Year'])['Revenue'].sum()
sales_yr_age['Profit_Share'] = sales_yr_age['Profit'] / sales_yr_age.groupby(['Year'])['Profit'].sum()
sales_yr_age.reset_index()

Unnamed: 0,Year,Age_Group,Profit,Revenue,Revenue_Share,Profit_Share
0,2011,Adults (35-64),1278031,3964024.4,0.437552,0.44356
1,2011,Seniors (64+),9169,25363.0,0.0028,0.003182
2,2011,Young Adults (25-34),1085579,3448203.7,0.380615,0.376767
3,2011,Youth (<25),508522,1621959.3,0.179033,0.17649
4,2012,Adults (35-64),1336260,4144715.5,0.447005,0.452664
5,2012,Seniors (64+),7506,23230.0,0.002505,0.002543
6,2012,Young Adults (25-34),1108066,3503094.7,0.377806,0.375362
7,2012,Youth (<25),500161,1601158.1,0.172684,0.169432
8,2013,Adults (35-64),2737478,6959641.5,0.452418,0.459369
9,2013,Seniors (64+),17764,39987.0,0.002599,0.002981


## Counting Unique Values

In [213]:
sales.groupby('Age_Group').agg({'State':'nunique'})

Unnamed: 0_level_0,State
Age_Group,Unnamed: 1_level_1
Adults (35-64),47
Seniors (64+),18
Young Adults (25-34),42
Youth (<25),41


## Aggregating on Multiple Values

In [214]:
sales_country = sales.groupby('Country').agg({'Order_Quantity':np.sum,'Revenue':np.sum}).reset_index()
sales_country

Unnamed: 0,Country,Order_Quantity,Revenue
0,Australia,263585,21302059.0
1,Canada,192259,7935738.0
2,France,128995,9276159.2
3,Germany,125720,8978596.0
4,United Kingdom,157218,10646196.0
5,United States,477539,27975547.0


In [215]:
sales_country['aov'] = sales_country['Revenue']/ sales_country['Order_Quantity']
sales_country

Unnamed: 0,Country,Order_Quantity,Revenue,aov
0,Australia,263585,21302059.0,80.816659
1,Canada,192259,7935738.0,41.276289
2,France,128995,9276159.2,71.910998
3,Germany,125720,8978596.0,71.417404
4,United Kingdom,157218,10646196.0,67.716139
5,United States,477539,27975547.0,58.582748


In [216]:
sales_country = sales.copy()

# Totals
sales_country['Orders_Total'] = sales_country['Order_Quantity']
sales_country['Revenue_Total'] = sales_country['Revenue']

# Average
sales_country['Orders_Avg'] = sales_country['Order_Quantity']
sales_country['Revenue_Avg'] = sales_country['Revenue']

sales_country = sales_country.groupby('Country').agg({'Orders_Total':np.sum,
                                                      'Orders_Avg':np.mean,
                                                      'Revenue_Total':np.sum,
                                                      'Revenue_Avg':np.mean
                                                        }).reset_index()
sales_country

Unnamed: 0,Country,Orders_Total,Orders_Avg,Revenue_Total,Revenue_Avg
0,Australia,263585,11.012074,21302059.0,889.959016
1,Canada,192259,13.560375,7935738.0,559.721964
2,France,128995,11.728951,9276159.2,843.440553
3,Germany,125720,11.328167,8978596.0,809.028293
4,United Kingdom,157218,11.543172,10646196.0,781.659031
5,United States,477539,12.180253,27975547.0,713.552696


## Weight Average Calculation

In [217]:
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean. Customize this if your business case
    should return otherwise.
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

In [218]:
census_wavg = census_county.groupby(['state_name']).apply(wavg, 'population','median_income').reset_index()
census_wavg = census_wavg.rename(columns={0: 'median_income'})
census_wavg

Unnamed: 0,state_name,median_income
0,Alabama,84219.71622
1,Alaska,28445.113534
2,Arizona,524846.75189
3,Arkansas,45096.516518
4,California,744266.958072
5,Colorado,99976.10444
6,Connecticut,448510.581719
7,Delaware,323252.049685
8,District of Columbia,648932.926287
9,Florida,319831.244261


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Selecting, Indexing, Filtering & Slicing

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Filtering

#### Loc is useful for Filtering data where the second line is good for selecting data

#### Look for individual rows using the loc function

In [219]:
sales = sales.set_index('Country')

In [220]:
sales.loc['Canada']

Unnamed: 0_level_0,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue,Revenue_per_Age,Calculated_Cost
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Canada,2013-11-26,26,11,2013,19,Youth (<25),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950.0,50.000000,370.80
Canada,2015-11-26,26,11,2015,19,Youth (<25),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950.0,50.000000,370.80
Canada,2013-08-02,2,8,2013,29,Young Adults (25-34),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,5,45,120,369,225,594.0,20.482759,231.75
Canada,2015-08-02,2,8,2015,29,Young Adults (25-34),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,7,45,120,517,315,832.0,28.689655,324.45
Canada,2013-09-02,2,9,2013,29,Young Adults (25-34),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,2,45,120,148,90,238.0,8.206897,92.70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Canada,2016-07-05,5,7,2016,38,Adults (35-64),M,British Columbia,Clothing,Vests,"Classic Vest, L",14,24,64,551,336,887.0,23.342105,346.08
Canada,2013-08-18,18,8,2013,31,Young Adults (25-34),F,British Columbia,Clothing,Vests,"Classic Vest, L",13,24,64,512,312,824.0,26.580645,321.36
Canada,2015-08-18,18,8,2015,31,Young Adults (25-34),F,British Columbia,Clothing,Vests,"Classic Vest, L",11,24,64,433,264,697.0,22.483871,271.92
Canada,2013-09-21,21,9,2013,31,Young Adults (25-34),F,British Columbia,Clothing,Vests,"Classic Vest, L",15,24,64,590,360,950.0,30.645161,370.80


#### select the last row by sequential position

In [221]:
sales.iloc[-1]

Date                2016-03-04 00:00:00
Day                                   4
Month                                 3
Year                               2016
Customer_Age                         37
Age_Group                Adults (35-64)
Customer_Gender                       F
State                     Seine (Paris)
Product_Category               Clothing
Sub_Category                      Vests
Product                 Classic Vest, L
Order_Quantity                       23
Unit_Cost                            24
Unit_Price                           64
Profit                              655
Cost                                552
Revenue                          1327.7
Revenue_per_Age               32.621622
Calculated_Cost                  568.56
Name: France, dtype: object

In [222]:
sales = sales.reset_index()
sales['Country']

0                 Canada
1                 Canada
2              Australia
3              Australia
4              Australia
               ...      
113031    United Kingdom
113032         Australia
113033         Australia
113034            France
113035            France
Name: Country, Length: 113036, dtype: object

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
# Filtering Multiple Value

In [223]:
geo_list = ['Canada','Australia','United States']
geo_filter = sales['Country'].isin(geo_list)
sales[geo_filter]

Unnamed: 0,Country,Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue,Revenue_per_Age,Calculated_Cost
0,Canada,2013-11-26,26,11,2013,19,Youth (<25),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950.0,50.000000,370.80
1,Canada,2015-11-26,26,11,2015,19,Youth (<25),M,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950.0,50.000000,370.80
2,Australia,2014-03-23,23,3,2014,49,Adults (35-64),M,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401.0,49.000000,1066.05
3,Australia,2016-03-23,23,3,2016,49,Adults (35-64),M,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088.0,42.612245,927.00
4,Australia,2014-05-15,15,5,2014,47,Adults (35-64),F,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418.0,8.893617,185.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113023,United States,2016-03-20,20,3,2016,34,Young Adults (25-34),M,California,Clothing,Vests,"Classic Vest, S",26,24,64,1007,624,1631.0,47.970588,642.72
113024,United States,2014-04-03,3,4,2014,34,Young Adults (25-34),M,California,Clothing,Vests,"Classic Vest, S",16,24,64,620,384,1004.0,29.529412,395.52
113025,United States,2016-04-03,3,4,2016,34,Young Adults (25-34),M,California,Clothing,Vests,"Classic Vest, S",14,24,64,542,336,878.0,25.823529,346.08
113032,Australia,2014-04-02,2,4,2014,18,Youth (<25),M,Queensland,Clothing,Vests,"Classic Vest, M",22,24,64,655,528,1183.0,65.722222,543.84


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## Retrieving Single and Multiple columns

In [224]:
sales.loc[(sales['Month'] == 'November') &
          (sales['Year'] == 2013),
          ['Age_Group','Revenue']
]

Unnamed: 0,Age_Group,Revenue


In [225]:
sales.loc[sales['State'] == 'Kentucky', ['Age_Group','Revenue']]

Unnamed: 0,Age_Group,Revenue
156,Adults (35-64),108.0
157,Adults (35-64),108.0
23826,Adults (35-64),238.0
23827,Adults (35-64),277.0
31446,Adults (35-64),914.0
31447,Adults (35-64),977.0
79670,Adults (35-64),54.0
79671,Adults (35-64),567.0
79672,Adults (35-64),27.0
79673,Adults (35-64),486.0


In [226]:
country_stat.loc['France': 'Italy']

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe


In [227]:
country_stat.loc['France': 'Italy', 'Population']

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64

In [228]:
country_stat.loc['France': 'Italy', ['Population', 'GDP']]

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744


In [229]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].head()

2    2401.0
3    2088.0
4     418.0
5     522.0
6     379.0
Name: Revenue, dtype: float64

### Get the mean revenue of the `Adults (35-64)` sales group

In [230]:
sales.loc[sales['Age_Group']=='Adults (35-64)', 'Revenue'].mean()

769.3162152479208

### How Many Records belong to Age Group `Youth (<25)` or `Adults 35-64`?

In [231]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]

73652

### Get the mean revenue of the sales group `Adults (35-64)` in `United States`

In [232]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()

726.7260473588342

### Increase the revenue by 10% to every sale made in France

In [233]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

50     865.7
51     865.7
52    3252.7
53    3136.1
60     688.6
Name: Revenue, dtype: float64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Dropping Data

In [234]:
country_stat.drop('Canada')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [235]:
country_stat.drop(['Canada', 'Japan'])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [236]:
country_stat.drop(columns=['Population', 'HDI'])

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


#### Drop rows using axis = 0 or rows

In [237]:
country_stat.drop(['Canada', 'Germany'], axis=0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [238]:
country_stat.drop(['Canada', 'Germany'], axis='rows')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### Drop Columns using axis = 1 or columns

In [239]:
country_stat.drop(['Population', 'HDI'], axis=1)

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [240]:
country_stat.drop(['Population', 'HDI'], axis='columns')

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Modifying Dataframes

In [241]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)

In [242]:
langs

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [243]:
country_stat['Language'] = langs
country_stat

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


In [244]:
country_stat['Language'] = 'English'
country_stat

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Renaming Columns

In [245]:
country_stat.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [246]:
country_stat.rename(index=str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
CANADA,35.467,1785387,9984670,0.913,America,English
FRANCE,63.951,2833687,640679,0.888,Europe,English
GERMANY,80.94,3874437,357114,0.916,Europe,English
ITALY,60.665,2167744,301336,0.873,Europe,English
JAPAN,127.061,4602367,377930,0.891,Asia,English
UNITED KINGDOM,64.511,2950039,242495,0.907,Europe,English
UNITED STATES,318.523,17348075,9525067,0.915,America,English


In [247]:
country_stat.rename(index=str.lower)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
canada,35.467,1785387,9984670,0.913,America,English
france,63.951,2833687,640679,0.888,Europe,English
germany,80.94,3874437,357114,0.916,Europe,English
italy,60.665,2167744,301336,0.873,Europe,English
japan,127.061,4602367,377930,0.891,Asia,English
united kingdom,64.511,2950039,242495,0.907,Europe,English
united states,318.523,17348075,9525067,0.915,America,English


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Appending Values

In [248]:
country_stat.append(pd.Series({
    'Population': 3,
    'GDP': 5
}, name='China'))

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,English
Germany,80.94,3874437.0,357114.0,0.916,Europe,English
Italy,60.665,2167744.0,301336.0,0.873,Europe,English
Japan,127.061,4602367.0,377930.0,0.891,Asia,English
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
United States,318.523,17348075.0,9525067.0,0.915,America,English
China,3.0,5.0,,,,


In [249]:
country_stat.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})
country_stat

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,English
Germany,80.94,3874437.0,357114.0,0.916,Europe,English
Italy,60.665,2167744.0,301336.0,0.873,Europe,English
Japan,127.061,4602367.0,377930.0,0.891,Asia,English
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
United States,318.523,17348075.0,9525067.0,0.915,America,English
China,1400000000.0,,,,Asia,


In [250]:
country_stat.reset_index()

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent,Language
0,Canada,35.467,1785387.0,9984670.0,0.913,America,English
1,France,63.951,2833687.0,640679.0,0.888,Europe,English
2,Germany,80.94,3874437.0,357114.0,0.916,Europe,English
3,Italy,60.665,2167744.0,301336.0,0.873,Europe,English
4,Japan,127.061,4602367.0,377930.0,0.891,Asia,English
5,United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
6,United States,318.523,17348075.0,9525067.0,0.915,America,English
7,China,1400000000.0,,,,Asia,


In [251]:
country_stat.set_index('Population')

Unnamed: 0_level_0,GDP,Surface Area,HDI,Continent,Language
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
35.467,1785387.0,9984670.0,0.913,America,English
63.951,2833687.0,640679.0,0.888,Europe,English
80.94,3874437.0,357114.0,0.916,Europe,English
60.665,2167744.0,301336.0,0.873,Europe,English
127.061,4602367.0,377930.0,0.891,Asia,English
64.511,2950039.0,242495.0,0.907,Europe,English
318.523,17348075.0,9525067.0,0.915,America,English
1400000000.0,,,,Asia,


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Cleaning Data

In [252]:
pd.isnull(np.nan)

True

In [253]:
pd.isnull(None)

True

In [254]:
pd.isna(np.nan)

True

In [255]:
pd.isna(None)

True

In [256]:
pd.notnull(None)

False

In [257]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


In [258]:
pd.Series([1, 2, np.nan]).count()

2

In [259]:
pd.Series([1, 2, np.nan]).sum()

3.0

In [260]:
pd.Series([2, 2, np.nan]).mean()

2.0

In [261]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [262]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [263]:
pd.isnull(s)

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [264]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [265]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [266]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

In [267]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [268]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

In [269]:
df_nulls = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

df_nulls

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [270]:
df_nulls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


In [271]:
df_nulls.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

In [272]:
df_nulls.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [273]:
df_nulls.dropna(axis=1)

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In this case, any row or column that contains **at least** one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the `how` parameter. Can be either `'any'` or `'all'`:

In [274]:
df_nulls2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})
df_nulls2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [275]:
df_nulls2.dropna(how='all') # if all columns have NA then drop

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [276]:
df_nulls2.dropna(how = 'any') # default behavior

Unnamed: 0,Column A,Column B,Column C
2,30.0,31.0,100.0


In [277]:
df_nulls2.dropna(thresh=3) # if at least 3 values are NA

Unnamed: 0,Column A,Column B,Column C
2,30.0,31.0,100.0


In [278]:
df_nulls2.dropna(thresh=3, axis='columns')

0
1
2


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Filling Null values on Dataframes

Filling with preset numbers or the statistical measure

In [279]:
df_nulls2.fillna({'Column A': 0, 'Column B': 99, 'Column C': df_nulls['Column C'].mean()})

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,47.0
1,0.0,99.0,47.0
2,30.0,31.0,100.0


Forwards filling or backward filling

In [280]:
df_nulls2.fillna(method='ffill', axis=0)

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,1.0,2.0,
2,30.0,31.0,100.0


In [281]:
df_nulls2.fillna(method='bfill', axis=0)

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,100.0
1,30.0,31.0,100.0
2,30.0,31.0,100.0


In [282]:
df_nulls2.dropna().count()

Column A    1
Column B    1
Column C    1
dtype: int64

In [283]:
missing_values = len(s.dropna() != len(s))
missing_values

4

In [284]:
len(s)

6

In [285]:
s.count()

4

**More Pythonic solution `any`**
The methods `any` and `all` check if either there's `any` True value in a Series or `all` the values are `True`. They work in the same way as in Python:

In [286]:
pd.Series([True, False, False]).any()

True

In [287]:
pd.Series([True, False, False]).all()

False

In [288]:
pd.Series([True, True, True]).all()

True

In [289]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Cleaning All Values

The previous `DataFrame` doesn't have any "missing value", but clearly has invalid data. `290` doesn't seem like a valid age, and `D` and `?` don't correspond with any known sex category. How can you clean these not-missing, but clearly invalid values then?

In [290]:
gender = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 290, 25],
})
gender

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,290
4,?,25


In [291]:
gender['Sex'].replace('D','F')

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

It can accept a dictionary of values to replace. For example, they also told you that there might be a few `'N's`, that should actually be `'M's`:

In [292]:
gender['Sex'].replace({'D': 'F', 'N': 'M'})

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

If you have many columns to replace, you could apply it at “DataFrame level”:

In [293]:
gender.replace({
    'Sex': {
        'D': 'F',
        'N': 'M'
    },
    'Age': {
        290: 29
    }
})

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,F,29
4,?,25


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Dealing with Duplicates

## The sales dataset contains information about daily store purchases. What is the correct way to return unique entries for each product?

In [294]:
sales_datacamp = pd.DataFrame(data={'date':['2018-01-15','2018-01-15','2018-01-16','2018-01-16','2018-01-17'],
                           'product_line':['Health and beauty', 'Electronic accessories', 'Home and lifestyle', 'Sports', 'Food and beverages'],
                           'product': ['Shampoo', 'Headphones', 'Lamp', 'Yoga mat', 'Milk'],
                           'unit_price': [6.99,25.38,46.33,39.99,5.99],
                           'quantity': [7,5,3,5,8]
                           })
sales_datacamp

Unnamed: 0,date,product_line,product,unit_price,quantity
0,2018-01-15,Health and beauty,Shampoo,6.99,7
1,2018-01-15,Electronic accessories,Headphones,25.38,5
2,2018-01-16,Home and lifestyle,Lamp,46.33,3
3,2018-01-16,Sports,Yoga mat,39.99,5
4,2018-01-17,Food and beverages,Milk,5.99,8


In [295]:
sales_datacamp.drop_duplicates(subset='product')

Unnamed: 0,date,product_line,product,unit_price,quantity
0,2018-01-15,Health and beauty,Shampoo,6.99,7
1,2018-01-15,Electronic accessories,Headphones,25.38,5
2,2018-01-16,Home and lifestyle,Lamp,46.33,3
3,2018-01-16,Sports,Yoga mat,39.99,5
4,2018-01-17,Food and beverages,Milk,5.99,8


In [296]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])

In [297]:
ambassadors

Gérard Araud                  France
Kim Darroch           United Kingdom
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Peter Wittig                 Germany
Peter Ammon                  Germany
Klaus Scharioth              Germany
dtype: object

In [298]:
ambassadors.duplicated()

Gérard Araud          False
Kim Darroch           False
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig          False
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

In this case `duplicated` didn't consider `'Kim Darroch'`, the first instance of the United Kingdom or `'Peter Wittig'` as duplicates. That's because, by default, it'll consider the first occurrence of the value as not-duplicate. You can change this behavior with the `keep` parameter:

In [299]:
ambassadors.duplicated(keep='last')

Gérard Araud          False
Kim Darroch            True
Peter Westmacott      False
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth       False
dtype: bool

In this case, the result is "flipped", `'Kim Darroch'` and `'Peter Wittig'` (the first ambassadors of their countries) are considered duplicates, but `'Peter Westmacott'` and `'Klaus Scharioth'` are not duplicates. You can also choose to mark all of them as duplicates with `keep=False`:

In [300]:
ambassadors.duplicated(keep=False)

Gérard Araud          False
Kim Darroch            True
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

A similar method is `drop_duplicates`, which just excludes the duplicated values and also accepts the `keep` parameter:

In [301]:
ambassadors.drop_duplicates(keep='last')

Gérard Araud                  France
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Klaus Scharioth              Germany
dtype: object

In [302]:
ambassadors.drop_duplicates(keep=False)

Gérard Araud          France
Armando Varricchio     Italy
dtype: object

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Duplicates in DataFrames

Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates:

In [303]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})

In [304]:
players

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
2,Kobe Bryant,SG
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [305]:
players.duplicated()

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [306]:
players.duplicated(subset=['Name'])

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [307]:
players.duplicated(subset=['Name'], keep='last')

0     True
1    False
2     True
3    False
4    False
dtype: bool

In [308]:
players.drop_duplicates()

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [309]:
players.drop_duplicates(subset=['Name'])

Unnamed: 0,Name,Pos
0,Kobe Bryant,SG
1,LeBron James,SF
3,Carmelo Anthony,SF


In [310]:
players.drop_duplicates(subset=['Name'], keep='last')

Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


In [311]:
sales['Year_Char'] = sales['Year'].astype(str)
sales.dtypes

Country                     object
Date                datetime64[ns]
Day                          int64
Month                        int64
Year                         int64
Customer_Age                 int64
Age_Group                   object
Customer_Gender             object
State                       object
Product_Category            object
Sub_Category                object
Product                     object
Order_Quantity               int64
Unit_Cost                    int64
Unit_Price                   int64
Profit                       int64
Cost                         int64
Revenue                    float64
Revenue_per_Age            float64
Calculated_Cost            float64
Year_Char                   object
dtype: object

## Encode `Product` as a `category` type. The columns of data frame `df` now have the following types:

In [312]:
sales['Product'] = sales.Product.astype('category')
sales.dtypes

Country                     object
Date                datetime64[ns]
Day                          int64
Month                        int64
Year                         int64
Customer_Age                 int64
Age_Group                   object
Customer_Gender             object
State                       object
Product_Category            object
Sub_Category                object
Product                   category
Order_Quantity               int64
Unit_Cost                    int64
Unit_Price                   int64
Profit                       int64
Cost                         int64
Revenue                    float64
Revenue_per_Age            float64
Calculated_Cost            float64
Year_Char                   object
dtype: object

# Joining Data

In [313]:
census_state = census_county.groupby('state_name').agg({'population':np.sum}).reset_index()

In [314]:
census_state = pd.merge(census_wavg,
                        census_state,
                           how="left",
                           left_on = ["state_name"],
                           right_on = ["state_name"])
census_state

Unnamed: 0,state_name,median_income,population
0,Alabama,84219.71622,43405190
1,Alaska,28445.113534,6544837
2,Arizona,524846.75189,59967006
3,Arkansas,45096.516518,26587370
4,California,744266.958072,344511143
5,Colorado,99976.10444,47542033
6,Connecticut,448510.581719,32238847
7,Delaware,323252.049685,8328117
8,District of Columbia,648932.926287,5808886
9,Florida,319831.244261,177385332


# Case When np.where

## Single Condition

In [318]:
census_state["is_arizona"] = np.where(census_state["state_name"] == 'Arizona', True, False).copy()
census_state

Unnamed: 0,state_name,median_income,population,sun_belt,is_arizona
0,Alabama,84219.71622,43405190,False,False
1,Alaska,28445.113534,6544837,False,False
2,Arizona,524846.75189,59967006,True,True
3,Arkansas,45096.516518,26587370,False,False
4,California,744266.958072,344511143,False,False
5,Colorado,99976.10444,47542033,True,False
6,Connecticut,448510.581719,32238847,False,False
7,Delaware,323252.049685,8328117,False,False
8,District of Columbia,648932.926287,5808886,False,False
9,Florida,319831.244261,177385332,False,False


## Using A list

In [319]:
geo_list =  ['Arizona','Nevada','New Mexico','Utah','Texas','Colorado']
census_state["sun_belt"] = np.where(census_state["state_name"].isin(geo_list), True, False).copy()
census_state

Unnamed: 0,state_name,median_income,population,sun_belt,is_arizona
0,Alabama,84219.71622,43405190,False,False
1,Alaska,28445.113534,6544837,False,False
2,Arizona,524846.75189,59967006,True,True
3,Arkansas,45096.516518,26587370,False,False
4,California,744266.958072,344511143,False,False
5,Colorado,99976.10444,47542033,True,False
6,Connecticut,448510.581719,32238847,False,False
7,Delaware,323252.049685,8328117,False,False
8,District of Columbia,648932.926287,5808886,False,False
9,Florida,319831.244261,177385332,False,False


## Multiple Condition

In [322]:
census_state["sun_belt"] = np.where((census_state["state_name"] == 'Arizona') |
                                    (census_state["state_name"] == 'Nevada') |
                                    (census_state["state_name"] == 'New Mexico'), True, False).copy()
census_state

Unnamed: 0,state_name,median_income,population,sun_belt,is_arizona
0,Alabama,84219.71622,43405190,False,False
1,Alaska,28445.113534,6544837,False,False
2,Arizona,524846.75189,59967006,True,True
3,Arkansas,45096.516518,26587370,False,False
4,California,744266.958072,344511143,False,False
5,Colorado,99976.10444,47542033,False,False
6,Connecticut,448510.581719,32238847,False,False
7,Delaware,323252.049685,8328117,False,False
8,District of Columbia,648932.926287,5808886,False,False
9,Florida,319831.244261,177385332,False,False
