![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/python_logo.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Common-Excel-Tasks-Demonstrated-in-Pandas" data-toc-modified-id="Common-Excel-Tasks-Demonstrated-in-Pandas-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Common Excel Tasks Demonstrated in Pandas</a></span><ul class="toc-item"><li><span><a href="#Adding-a-Sum-to-a-Row" data-toc-modified-id="Adding-a-Sum-to-a-Row-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Adding a Sum to a Row</a></span></li><li><span><a href="#Additional-Data-Transforms" data-toc-modified-id="Additional-Data-Transforms-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Additional Data Transforms</a></span></li><li><span><a href="#Subtotals" data-toc-modified-id="Subtotals-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Subtotals</a></span></li><li><span><a href="#Filtering-the-data" data-toc-modified-id="Filtering-the-data-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Filtering the data</a></span></li><li><span><a href="#Working-with-Dates" data-toc-modified-id="Working-with-Dates-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Working with Dates</a></span></li><li><span><a href="#Additional-String-Functions" data-toc-modified-id="Additional-String-Functions-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Additional String Functions</a></span></li><li><span><a href="#Bonus-Task" data-toc-modified-id="Bonus-Task-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Bonus Task</a></span></li></ul></li></ul></div>

# Common Excel Tasks Demonstrated in Pandas
The purpose of this is to show some common Excel tasks and how you would execute similar tasks in pandas. Some of the examples are somehow trivial but it is important to show the simple as well as the more complex functions you can find elsewhere. As an added bonus, we are going to do some fuzzy string matching to show a little twist to the process and show how pandas can utilize the full python system of modules to do something simply in python that would be complex in Excel.

## Adding a Sum to a Row
The first task I’ll cover is summing some columns to add a total column.

In [155]:
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/fjvarasc/DSPXI/blob/master/data/excel-comp-data.xlsx?raw=true")
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


We want to add a total column to show total sales for Jan, Feb and Mar.

This is straightforward in Excel and in pandas. For Excel, I have added the formula `sum(G2:I2)` in column J. Here is what it looks like in Excel:

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/excel-1.png)

Next, here is how we do it in pandas:

In [113]:
df["total"] = df["Jan"] + df["Feb"] + df["Mar"]
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000,107000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000,175000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000,246000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000,175000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000,317000


Next, let’s get some totals and other values for each month. Here is what we are trying to do as shown in Excel:

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/excel-2.png)

As you can see, we added a `SUM(G2:G16)` in row 17 in each of the columns to get totals by month.

Performing column level analysis is easy in pandas. Here are a couple of examples.

In [60]:
df["Jan"].sum(), df["Jan"].mean(),df["Jan"].min(),df["Jan"].max()

(1462000, 97466.66666666667, 10000, 162000)

Now, we want to add a total by month and grand total. This is where pandas and Excel diverge a little. It is very simple to add totals in cells in Excel for each month. Because pandas need to maintain the integrity of the entire DataFrame, there are a couple more steps.

First, create a sum for the month and total columns.

In [61]:
sum_row=df[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64

This is fairly intuitive however, if you want to add totals as a row, you need to do some minor manipulations.

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The `T` function allows us to switch the data from being row-based to column-base

In [62]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,Jan,Feb,Mar,total
0,1462000,1507000,717000,3686000


The final thing we need to do before adding the totals back is to add the missing columns. We use `reindex` to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.

In [63]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
0,,,,,,,1462000,1507000,717000,3686000


Now that we have a nicely formatted DataFrame, we can add it to our existing one using `append`.

In [64]:
df_final=df.append(df_sum,ignore_index=True)
df_final.tail()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
11,231907.0,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415.0,150000,10000,162000,322000
12,242368.0,"Frami, Anderson and Donnelly",182 Bertie Road,East Davian,Iowa,72686.0,162000,120000,35000,317000
13,268755.0,Walsh-Haley,2624 Beatty Parkways,Goodwinmouth,RhodeIsland,31919.0,55000,120000,35000,210000
14,273274.0,McDermott PLC,8917 Bergstrom Meadow,Kathryneborough,Delaware,27933.0,150000,120000,70000,340000
15,,,,,,,1462000,1507000,717000,3686000


## Additional Data Transforms
For another example, let’s try to add a state abbreviation to the data set.

From an Excel perspective the easiest way is probably to add a new column, do a vlookup on the state name and fill in the abbreviation.

I did this and here is a snapshot of what the results looks like:

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/excel-3.png)


You’ll notice that after performing the vlookup, there are some values that are not coming through correctly. That’s because we misspelled some of the states. Handling this in Excel would be really challenging (on big data sets).

Fortunately with pandas we have the full power of the python ecosystem at our disposal. In thinking about how to solve this type of messy data problem, I thought about trying to do some fuzzy text matching to determine the correct value.

Fortunately someone else has done a lot of work in this area. The [fuzzy wuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) library has some pretty useful functions for this type of situation.

Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary.

In [65]:
"""
state_names returns a state name for a state code, like 'AK': 'Alaska'
country_names returns a country name for a country code, like 'AD': 'Andorra'
state_codes and country_codes are just the reverse
"""

state_names = {
 'AK': 'Alaska',
 'AL': 'Alabama',
 'AR': 'Arkansas',
 'AS': 'American Samoa',
 'AZ': 'Arizona',
 'CA': 'California',
 'CO': 'Colorado',
 'CT': 'Connecticut',
 'DC': 'District of Columbia',
 'DE': 'Delaware',
 'FL': 'Florida',
 'GA': 'Georgia',
 'GU': 'Guam',
 'HI': 'Hawaii',
 'IA': 'Iowa',
 'ID': 'Idaho',
 'IL': 'Illinois',
 'IN': 'Indiana',
 'KS': 'Kansas',
 'KY': 'Kentucky',
 'LA': 'Louisiana',
 'MA': 'Massachusetts',
 'MD': 'Maryland',
 'ME': 'Maine',
 'MI': 'Michigan',
 'MN': 'Minnesota',
 'MO': 'Missouri',
 'MP': 'Northern Mariana Islands',
 'MS': 'Mississippi',
 'MT': 'Montana',
 'NC': 'North Carolina',
 'ND': 'North Dakota',
 'NE': 'Nebraska',
 'NH': 'New Hampshire',
 'NJ': 'New Jersey',
 'NM': 'New Mexico',
 'NV': 'Nevada',
 'NY': 'New York',
 'OH': 'Ohio',
 'OK': 'Oklahoma',
 'OR': 'Oregon',
 'PA': 'Pennsylvania',
 'PR': 'Puerto Rico',
 'RI': 'Rhode Island',
 'SC': 'South Carolina',
 'SD': 'South Dakota',
 'TN': 'Tennessee',
 'TX': 'Texas',
 'UT': 'Utah',
 'VA': 'Virginia',
 'VI': 'Virgin Islands',
 'VT': 'Vermont',
 'WA': 'Washington',
 'WI': 'Wisconsin',
 'WV': 'West Virginia',
 'WY': 'Wyoming',
 }

country_names = {
 'AD': 'Andorra',
 'AE': 'United Arab Emirates',
 'AF': 'Afghanistan',
 'AG': 'Antigua and Barbuda',
 'AI': 'Anguilla',
 'AL': 'Albania',
 'AM': 'Armenia',
 'AN': 'Netherlands Antilles',
 'AO': 'Angola',
 'AQ': 'Antarctica',
 'AR': 'Argentina',
 'AS': 'American Samoa',
 'AT': 'Austria',
 'AU': 'Australia',
 'AW': 'Aruba',
 'AX': '\xc3\x85land Islands',
 'AZ': 'Azerbaijan',
 'BA': 'Bosnia and Herzegovina',
 'BB': 'Barbados',
 'BD': 'Bangladesh',
 'BE': 'Belgium',
 'BF': 'Burkina Faso',
 'BG': 'Bulgaria',
 'BH': 'Bahrain',
 'BI': 'Burundi',
 'BJ': 'Benin',
 'BL': 'Saint Barth\xc3\xa9lemy',
 'BM': 'Bermuda',
 'BN': 'Brunei Darussalam',
 'BO': 'Bolivia, Plurinational State of',
 'BR': 'Brazil',
 'BS': 'Bahamas',
 'BT': 'Bhutan',
 'BV': 'Bouvet Island',
 'BW': 'Botswana',
 'BY': 'Belarus',
 'BZ': 'Belize',
 'CA': 'Canada',
 'CC': 'Cocos (Keeling) Islands',
 'CD': 'Congo, The Democratic Republic of the',
 'CF': 'Central African Republic',
 'CG': 'Congo',
 'CH': 'Switzerland',
 'CI': "C\xc3\xb4te d'Ivoire",
 'CK': 'Cook Islands',
 'CL': 'Chile',
 'CM': 'Cameroon',
 'CN': 'China',
 'CO': 'Colombia',
 'CR': 'Costa Rica',
 'CU': 'Cuba',
 'CV': 'Cape Verde',
 'CX': 'Christmas Island',
 'CY': 'Cyprus',
 'CZ': 'Czech Republic',
 'DE': 'Germany',
 'DJ': 'Djibouti',
 'DK': 'Denmark',
 'DM': 'Dominica',
 'DO': 'Dominican Republic',
 'DZ': 'Algeria',
 'EC': 'Ecuador',
 'EE': 'Estonia',
 'EG': 'Egypt',
 'EH': 'Western Sahara',
 'ER': 'Eritrea',
 'ES': 'Spain',
 'ET': 'Ethiopia',
 'FI': 'Finland',
 'FJ': 'Fiji',
 'FK': 'Falkland Islands (Malvinas)',
 'FM': 'Micronesia, Federated States of',
 'FO': 'Faroe Islands',
 'FR': 'France',
 'GA': 'Gabon',
 'GB': 'United Kingdom',
 'GD': 'Grenada',
 'GE': 'Georgia',
 'GF': 'French Guiana',
 'GG': 'Guernsey',
 'GH': 'Ghana',
 'GI': 'Gibraltar',
 'GL': 'Greenland',
 'GM': 'Gambia',
 'GN': 'Guinea',
 'GP': 'Guadeloupe',
 'GQ': 'Equatorial Guinea',
 'GR': 'Greece',
 'GS': 'South Georgia and the South Sandwich Islands',
 'GT': 'Guatemala',
 'GU': 'Guam',
 'GW': 'Guinea-Bissau',
 'GY': 'Guyana',
 'HK': 'Hong Kong',
 'HM': 'Heard Island and McDonald Islands',
 'HN': 'Honduras',
 'HR': 'Croatia',
 'HT': 'Haiti',
 'HU': 'Hungary',
 'ID': 'Indonesia',
 'IE': 'Ireland',
 'IL': 'Israel',
 'IM': 'Isle of Man',
 'IN': 'India',
 'IO': 'British Indian Ocean Territory',
 'IQ': 'Iraq',
 'IR': 'Iran, Islamic Republic of',
 'IS': 'Iceland',
 'IT': 'Italy',
 'JE': 'Jersey',
 'JM': 'Jamaica',
 'JO': 'Jordan',
 'JP': 'Japan',
 'KE': 'Kenya',
 'KG': 'Kyrgyzstan',
 'KH': 'Cambodia',
 'KI': 'Kiribati',
 'KM': 'Comoros',
 'KN': 'Saint Kitts and Nevis',
 'KP': "Korea, Democratic People's Republic of",
 'KR': 'Korea, Republic of',
 'KW': 'Kuwait',
 'KY': 'Cayman Islands',
 'KZ': 'Kazakhstan',
 'LA': "Lao People's Democratic Republic",
 'LB': 'Lebanon',
 'LC': 'Saint Lucia',
 'LI': 'Liechtenstein',
 'LK': 'Sri Lanka',
 'LR': 'Liberia',
 'LS': 'Lesotho',
 'LT': 'Lithuania',
 'LU': 'Luxembourg',
 'LV': 'Latvia',
 'LY': 'Libyan Arab Jamahiriya',
 'MA': 'Morocco',
 'MC': 'Monaco',
 'MD': 'Moldova, Republic of',
 'ME': 'Montenegro',
 'MF': 'Saint Martin',
 'MG': 'Madagascar',
 'MH': 'Marshall Islands',
 'MK': 'Macedonia, The Former Yugoslav Republic of',
 'ML': 'Mali',
 'MM': 'Myanmar',
 'MN': 'Mongolia',
 'MO': 'Macao',
 'MP': 'Northern Mariana Islands',
 'MQ': 'Martinique',
 'MR': 'Mauritania',
 'MS': 'Montserrat',
 'MT': 'Malta',
 'MU': 'Mauritius',
 'MV': 'Maldives',
 'MW': 'Malawi',
 'MX': 'Mexico',
 'MY': 'Malaysia',
 'MZ': 'Mozambique',
 'NA': 'Namibia',
 'NC': 'New Caledonia',
 'NE': 'Niger',
 'NF': 'Norfolk Island',
 'NG': 'Nigeria',
 'NI': 'Nicaragua',
 'NL': 'Netherlands',
 'NO': 'Norway',
 'NP': 'Nepal',
 'NR': 'Nauru',
 'NU': 'Niue',
 'NZ': 'New Zealand',
 'OM': 'Oman',
 'PA': 'Panama',
 'PE': 'Peru',
 'PF': 'French Polynesia',
 'PG': 'Papua New Guinea',
 'PH': 'Philippines',
 'PK': 'Pakistan',
 'PL': 'Poland',
 'PM': 'Saint Pierre and Miquelon',
 'PN': 'Pitcairn',
 'PR': 'Puerto Rico',
 'PS': 'Palestinian Territory, Occupied',
 'PT': 'Portugal',
 'PW': 'Palau',
 'PY': 'Paraguay',
 'QA': 'Qatar',
 'RE': 'R\xc3\xa9union',
 'RO': 'Romania',
 'RS': 'Serbia',
 'RU': 'Russian Federation',
 'RW': 'Rwanda',
 'SA': 'Saudi Arabia',
 'SB': 'Solomon Islands',
 'SC': 'Seychelles',
 'SD': 'Sudan',
 'SE': 'Sweden',
 'SG': 'Singapore',
 'SH': 'Saint Helena',
 'SI': 'Slovenia',
 'SJ': 'Svalbard and Jan Mayen',
 'SK': 'Slovakia',
 'SL': 'Sierra Leone',
 'SM': 'San Marino',
 'SN': 'Senegal',
 'SO': 'Somalia',
 'SR': 'Suriname',
 'ST': 'Sao Tome and Principe',
 'SV': 'El Salvador',
 'SY': 'Syrian Arab Republic',
 'SZ': 'Swaziland',
 'TC': 'Turks and Caicos Islands',
 'TD': 'Chad',
 'TF': 'French Southern Territories',
 'TG': 'Togo',
 'TH': 'Thailand',
 'TJ': 'Tajikistan',
 'TK': 'Tokelau',
 'TL': 'Timor-Leste',
 'TM': 'Turkmenistan',
 'TN': 'Tunisia',
 'TO': 'Tonga',
 'TR': 'Turkey',
 'TT': 'Trinidad and Tobago',
 'TV': 'Tuvalu',
 'TW': 'Taiwan, Province of China',
 'TZ': 'Tanzania, United Republic of',
 'UA': 'Ukraine',
 'UG': 'Uganda',
 'UM': 'United States Minor Outlying Islands',
 'US': 'United States',
 'UY': 'Uruguay',
 'UZ': 'Uzbekistan',
 'VA': 'Holy See (Vatican City State)',
 'VC': 'Saint Vincent and the Grenadines',
 'VE': 'Venezuela, Bolivarian Republic of',
 'VG': 'Virgin Islands, British',
 'VI': 'Virgin Islands, U.S.',
 'VN': 'Viet Nam',
 'VU': 'Vanuatu',
 'WF': 'Wallis and Futuna',
 'WS': 'Samoa',
 'YE': 'Yemen',
 'YT': 'Mayotte',
 'ZA': 'South Africa',
 'ZM': 'Zambia',
 'ZW': 'Zimbabwe',
 }

# Reverse mappings, too, for convenience
state_codes = dict((v, k) for k, v in state_names.items())
country_codes = dict((v, k) for k, v in country_names.items())

In [66]:
#example using state code
state_names['NY']

'New York'

In [101]:
#using state name
state_codes['New York']

'NY'

In [68]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Here are some example of how the fuzzy text matching function works.

In [69]:
process.extractOne("Minnesotta",choices=state_codes.keys())

('Minnesota', 95)

In [70]:
process.extractOne("AlaBAMMazzz",choices=state_codes.keys(),score_cutoff=80)

Add the column in the location we want and fill it with NaN values

In [114]:
df.insert(6, "abbrev", np.nan)
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,abbrev,Jan,Feb,Mar,total
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,,10000,62000,35000,107000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,,95000,45000,35000,175000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,,91000,120000,35000,246000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,,45000,120000,10000,175000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,,162000,120000,35000,317000


We use `apply` to add the abbreviations into the approriate column.

In [None]:
def convert_state(row):
    abbrev = process.extractOne(row["state"],choices=state_codes.keys(),score_cutoff=80)
    if abbrev:
        #print (abbrev)
        return state_codes[abbrev[0]]
    return np.nan

In [115]:
df['abbrev'] = df.apply(convert_state, axis=1)
df.tail()

Unnamed: 0,account,name,street,city,state,postal-code,abbrev,Jan,Feb,Mar,total
10,214098,"Goodwin, Homenick and Jerde",649 Cierra Forks Apt. 078,Rosaberg,Tenessee,47743,TN,45000,120000,55000,220000
11,231907,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415,ND,150000,10000,162000,322000
12,242368,"Frami, Anderson and Donnelly",182 Bertie Road,East Davian,Iowa,72686,IA,162000,120000,35000,317000
13,268755,Walsh-Haley,2624 Beatty Parkways,Goodwinmouth,RhodeIsland,31919,RI,55000,120000,35000,210000
14,273274,McDermott PLC,8917 Bergstrom Meadow,Kathryneborough,Delaware,27933,DE,150000,120000,70000,340000


We have developed a very simple process to intelligently clean up this data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had 15,000? You would have to do something manual in Excel to clean this up.

## Subtotals
For the final section of this article, let’s get some subtotals by state.

In Excel, we would use the `subtotal` tool to do this for us.

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/excel-4.png)


The output would look like this:

![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/excel-5.png)

Creating a subtotal in pandas, is accomplished using `groupby`

In [116]:
df_sub=df[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()
df_sub

Unnamed: 0_level_0,Jan,Feb,Mar,total
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR,150000,120000,35000,305000
CA,162000,120000,35000,317000
DE,150000,120000,70000,340000
IA,253000,240000,70000,563000
ID,70000,120000,35000,225000
ME,45000,120000,10000,175000
MS,62000,120000,70000,252000
NC,95000,45000,35000,175000
ND,150000,10000,162000,322000
PA,70000,95000,35000,200000


Next, we want to format the data as currency by using `applymap` to all the values in the data frame.

In [117]:
def money(x):
    return "${:,.0f}".format(x)

formatted_df = df_sub.applymap(money)
formatted_df

Unnamed: 0_level_0,Jan,Feb,Mar,total
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR,"$150,000","$120,000","$35,000","$305,000"
CA,"$162,000","$120,000","$35,000","$317,000"
DE,"$150,000","$120,000","$70,000","$340,000"
IA,"$253,000","$240,000","$70,000","$563,000"
ID,"$70,000","$120,000","$35,000","$225,000"
ME,"$45,000","$120,000","$10,000","$175,000"
MS,"$62,000","$120,000","$70,000","$252,000"
NC,"$95,000","$45,000","$35,000","$175,000"
ND,"$150,000","$10,000","$162,000","$322,000"
PA,"$70,000","$95,000","$35,000","$200,000"


The formatting looks good, now we can get the totals like we did earlier.

In [118]:
sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64

Convert the values to columns and format it.

In [119]:
df_sub_sum=pd.DataFrame(data=sum_row).T
df_sub_sum=df_sub_sum.applymap(money)
df_sub_sum

Unnamed: 0,Jan,Feb,Mar,total
0,"$1,462,000","$1,507,000","$717,000","$3,686,000"


Finally, add the total value to the DataFrame

In [120]:
final_table = formatted_df.append(df_sub_sum)
final_table

Unnamed: 0,Jan,Feb,Mar,total
AR,"$150,000","$120,000","$35,000","$305,000"
CA,"$162,000","$120,000","$35,000","$317,000"
DE,"$150,000","$120,000","$70,000","$340,000"
IA,"$253,000","$240,000","$70,000","$563,000"
ID,"$70,000","$120,000","$35,000","$225,000"
ME,"$45,000","$120,000","$10,000","$175,000"
MS,"$62,000","$120,000","$70,000","$252,000"
NC,"$95,000","$45,000","$35,000","$175,000"
ND,"$150,000","$10,000","$162,000","$322,000"
PA,"$70,000","$95,000","$35,000","$200,000"


You’ll notice that the index is ‘0’ for the total line. We want to change that using `rename` .

In [121]:
final_table = final_table.rename(index={0:"Total"})
final_table

Unnamed: 0,Jan,Feb,Mar,total
AR,"$150,000","$120,000","$35,000","$305,000"
CA,"$162,000","$120,000","$35,000","$317,000"
DE,"$150,000","$120,000","$70,000","$340,000"
IA,"$253,000","$240,000","$70,000","$563,000"
ID,"$70,000","$120,000","$35,000","$225,000"
ME,"$45,000","$120,000","$10,000","$175,000"
MS,"$62,000","$120,000","$70,000","$252,000"
NC,"$95,000","$45,000","$35,000","$175,000"
ND,"$150,000","$10,000","$162,000","$322,000"
PA,"$70,000","$95,000","$35,000","$200,000"


## Filtering the data

Import the pandas and numpy modules.

In [161]:
import pandas as pd
import numpy as np

Load in the Excel data that represents a year's worth of sales.

In [162]:
df = pd.read_excel("https://github.com/fjvarasc/DSPXI/blob/master/data/sample-salesv3.xlsx?raw=true")

Take a quick look at the data types to make sure everything came through as expected.

In [163]:
df.dtypes

account number      int64
name               object
sku                object
quantity            int64
unit price        float64
ext price         float64
date               object
dtype: object

You'll notice that our date column is showing up as a generic `object`. We are going to convert it to datetime object to make some selections a little easier.

In [164]:
df['date'] = pd.to_datetime(df['date'])

In [165]:
df.head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


In [6]:
df.dtypes

account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

The date is now a datetime object which will be useful in future steps.

Similar to the autofilter function in Excel, you can use pandas to filter and select certain subsets of data.

For instance, if we want to just see a specific account number, we can easily do that with pandas.

Note, I am going to use the `head` function to show the top results. This is purely for the purposes of keeping the article shorter.

In [166]:
df[df["account number"]==307599].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
13,307599,"Kassulke, Ondricka and Metz",S2-10342,17,12.44,211.48,2014-01-04 07:53:01
34,307599,"Kassulke, Ondricka and Metz",S2-78676,35,33.04,1156.4,2014-01-10 05:26:31
58,307599,"Kassulke, Ondricka and Metz",B1-20000,22,37.87,833.14,2014-01-15 16:22:22
70,307599,"Kassulke, Ondricka and Metz",S2-10342,44,96.79,4258.76,2014-01-18 06:32:31


You could also do the filtering based on numeric values.

In [167]:
df[df["quantity"] > 22].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
14,737550,"Fritsch, Russel and Anderson",B1-53102,23,71.56,1645.88,2014-01-04 08:57:48
15,239344,Stokes LLC,S1-06532,34,71.51,2431.34,2014-01-04 11:34:58


If we want to do more complex filtering, we can use `map` to filter. In this example, let's look for items with sku's that start with B1.

In [168]:
df[df["sku"].map(lambda x: x.startswith('B1'))].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23
14,737550,"Fritsch, Russel and Anderson",B1-53102,23,71.56,1645.88,2014-01-04 08:57:48
17,239344,Stokes LLC,B1-50809,14,16.23,227.22,2014-01-04 22:14:32


It's easy to chain two statements together using the &.

In [169]:
df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
14,737550,"Fritsch, Russel and Anderson",B1-53102,23,71.56,1645.88,2014-01-04 08:57:48
26,737550,"Fritsch, Russel and Anderson",B1-53636,42,42.06,1766.52,2014-01-08 00:02:11
31,714466,Trantow-Barrows,B1-33087,32,19.56,625.92,2014-01-09 10:16:32


Another useful function that pandas supports is called `isin`. It allows us to define a list of values we want to look for.

In this case, we look for all records that include two specific account numbers.

In [170]:
df[df["account number"].isin([714466,218895])].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
5,714466,Trantow-Barrows,S2-77896,17,87.63,1489.71,2014-01-02 10:07:15
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23
8,714466,Trantow-Barrows,S1-50961,22,84.09,1849.98,2014-01-03 11:29:02


Pandas supports another function called `query` which allows you to efficiently select subsets of data. It does require the installation of [numexpr](https://github.com/pydata/numexpr) so make sure you have it installed before trying this step.

If you would like to get a list of customers by name, you can do that with a query, similar to the python syntax shown above.

In [171]:
df.query('name == ["Kulas Inc","Barton LLC"]').head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23
33,218895,Kulas Inc,S1-06532,3,22.36,67.08,2014-01-09 23:58:27
36,218895,Kulas Inc,S2-34077,16,73.04,1168.64,2014-01-10 12:07:30


The query function allows you do more than just this simple example but for the purposes of this discussion, I'm showing it so you are aware that it is out there for you.

## Working with Dates

In [None]:
import pandas as pd
import numpy as np

Load in the Excel data that represents a year's worth of sales.

In [None]:
df = pd.read_excel("https://github.com/fjvarasc/DSPXI/blob/master/data/sample-salesv3.xlsx?raw=true")
#Using the below code we are making sure that 'date' column from csv is a datetime field in the dataframe
df['date'] = df['date'].astype('datetime64[ns]')

Using pandas, you can do complex filtering on dates. Before doing anything with dates, I encourage you to sort by the date column to make sure the results return what you are expecting.

In [None]:
df = df.sort_values(by=['date'])
df.head()


The python filtering syntax shown before works with dates.

In [None]:
df[df['date'] >='2014-09-05'].head()

One of the really nice features of pandas is that it understands dates so will allow us to do partial filtering. If we want to only look for data more recent than a specific month, we can do so.

In [None]:
df[df['date'] >='2014-03'].head()

Of course, you can chain the criteria.

In [None]:
df[(df['date'] >='20140702') & (df['date'] <= '2014-07-15')].head()

Because pandas understands date columns, you can express the date value in multiple formats and it will give you the results you expect.

In [None]:
df[df['date'] >= 'Oct-2014'].head()

In [None]:
df[df['date'] >= '102014'].head()

When working with time series data, if we convert the data to use the date as at the index, we can do some more filtering.

Set the new index using `set_index`.

In [None]:
df2 = df.set_index(['date'])
df2.head()

We can slice the data to get a range.

In [None]:
df2["2014-01-01":"2014-02-01"].head()

Once again, we can use various date representations to remove any ambiguity around date naming conventions.

In [None]:
df2["2014-Jan-1":"2014-Feb-1"].head()

In [None]:
df2["2014-Jan-1":"2014-Feb-1"].tail()

In [None]:
df2["2014"].head()

In [None]:
df2["2014-Dec"].head()

## Additional String Functions

Pandas has support for vectorized string functions as well. If we want to identify all the skus that contain a certain value, we can use `str.contains`. In this case, we know that the sku is always represented in the same way, so B1 only shows up in the front of the sku.

In [None]:
df[df['sku'].str.contains('B1')].head()

We can string queries together and use sort to control how the data is ordered.

A common need in Excel is to understand all the unique items in a column. For instance, maybe we only want to know when customers purchased in this time period. The unique function makes this trivial.

In [None]:
df[(df['sku'].str.contains('B1-531')) & (df['quantity']>40)].sort_values(by=['quantity','name'],ascending=[0,1])

## Bonus Task

A very frequent scenario is trying to get a list of unique items in a long list within Excel. It is a multi-step process to do this in Excel but is fairly simple in pandas. We just use the [unique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) function on a column to get the list.

In [None]:
df["name"].unique()

If we wanted to include the account number, we could use `drop_duplicates`.

In [None]:
df.drop_duplicates(subset=["account number","name"]).head()

We are obviously pulling in more data than we need and getting some non-useful information, so select only the first and second columns using [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html).

In [None]:
df.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1]]

Now we encourage you to try and apply these ideas to some of your own repetitive Excel tasks and streamline your work flow.