# Cute pandas as in Python package

Table of Contents

* [Loading Data](#Loading-Data)
* [Inspecting Data](#Inspecting-Data)
* [Cleaning Data](#Cleaning-Data)
* [Resources](#Resources)

In [131]:
# First we need to import the package we will be working with, in this case pandas.
# If you installed Python and Jupter using Anacona distribution, pandas 
# and many other science related packages were included

import pandas as pd

## Loading Data

In [132]:
# I have a csv file saved in the same folder as this notebook
# data_frame.read_csv method is used to load the csv file into a DataFrame
# A DataFrame is just a fancy word for a table on steroids, table implies two dimensionality
# and each column is a Series, one dimensional

fin_sample = pd.read_csv('financial_sample.csv')

## Inspecting Data

In [133]:
# Let's take a look at the first five rows of our brand new DataFrame, and the last five
fin_sample.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
0,Government,Canada,Carretera,,1618.5,$3.00,$20.00,"$32,370.00",$-,"$32,370.00","$16,185.00","$16,185.00",1/1/14,1,January,2014
1,Government,Germany,Carretera,,1321.0,$3.00,$20.00,"$26,420.00",$-,"$26,420.00","$13,210.00","$13,210.00",1/1/14,1,January,2014
2,Midmarket,France,Carretera,,2178.0,$3.00,$15.00,"$32,670.00",$-,"$32,670.00","$21,780.00","$10,890.00",6/1/14,6,June,2014
3,Midmarket,Germany,Carretera,,888.0,$3.00,$15.00,"$13,320.00",$-,"$13,320.00","$8,880.00","$4,440.00",6/1/14,6,June,2014
4,Midmarket,Mexico,Carretera,,2470.0,$3.00,$15.00,"$37,050.00",$-,"$37,050.00","$24,700.00","$12,350.00",6/1/14,6,June,2014


In [134]:
fin_sample.tail()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
20995,Small Business,France,Amarilla,High,2475.0,$260.00,$300.00,"$742,500.00","$111,375.00","$631,125.00","$618,750.00","$12,375.00",3/1/2014,3,March,2014
20996,Small Business,Mexico,Amarilla,High,546.0,$260.00,$300.00,"$163,800.00","$24,570.00","$139,230.00","$136,500.00","$2,730.00",10/1/2014,10,October,2014
20997,Government,Mexico,Montana,High,1368.0,$5.00,$7.00,"$9,576.00","$1,436.40","$8,139.60","$6,840.00","$1,299.60",2/1/2014,2,February,2014
20998,Government,Canada,Paseo,High,723.0,$10.00,$7.00,"$5,061.00",$759.15,"$4,301.85","$3,615.00",$686.85,4/1/2014,4,April,2014
20999,Channel Partners,United States of America,VTT,High,1806.0,$250.00,$12.00,"$21,672.00","$3,250.80","$18,421.20","$5,418.00","$13,003.20",5/1/2014,5,May,2014


## Cleaning Data

In [135]:
# First what we can notice is that few columns contain dollar signs and other characters '$-'.
# We need to clean that up and check if there are trailing spaces.
# Let's first look at the Discounts column since we will have to clean up these, '$-'
# To look at a specific column - data_frame['Column']

In [137]:
fin_sample['Discounts']

KeyError: 'Discounts'

In [148]:
# Ooops what happened here?
# Let's look at all the column names to see if we are using a correct column name

In [149]:
fin_sample.columns

Index(['Segment', 'Country', 'Product', 'Discount Band', 'Units Sold',
       'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts',
       'Sales', 'COGS', 'Profit', 'Date', 'Month Number', 'Month Name',
       'Year'],
      dtype='object')

In [150]:
# And there they are, pesky trailing spaces all over the place.
# The reason this code above failed - fin_sample['Discounts'] is 
# because our column name is actually ' Discounts ', not 'Discounts', according to the output above.
# We need strip all the space in column names first before proceeding.

In [151]:
# This line below renames columns name in our DataFrame in place (without creating a new DataFrame)
# It does it by applying an anonymous Python function to each x. Each x is a column name. 
# We applying the .strip() function to each column name, or x.

fin_sample.rename(columns=lambda x: x.strip(), inplace=True)

In [152]:
# Now let's look at those columns.
fin_sample.columns

Index(['Segment', 'Country', 'Product', 'Discount Band', 'Units Sold',
       'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts',
       'Sales', 'COGS', 'Profit', 'Date', 'Month Number', 'Month Name',
       'Year'],
      dtype='object')

In [153]:
# Perfect it worked!
fin_sample['Discounts']

0                $-   
1                $-   
2                $-   
3                $-   
4                $-   
             ...      
20995     $111,375.00 
20996      $24,570.00 
20997       $1,436.40 
20998         $759.15 
20999       $3,250.80 
Name: Discounts, Length: 21000, dtype: object

In [154]:
# We have 21,000 rows of data! Yowza!

In [156]:
# Now these '$-' need to be stripped as well
# We will create a new DataFrame after applying another .strip() function to all elements

In [160]:
fin_sample_clean = fin_sample.apply(lambda x: x.str.strip('$-') if x.dtype == "object" else x)

In [161]:
# It didn't work! '$-' are still there in Discounts column!
fin_sample_clean.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
0,Government,Canada,Carretera,,1618.5,$3.00,$20.00,"$32,370.00",$-,"$32,370.00","$16,185.00","$16,185.00",1/1/14,1,January,2014
1,Government,Germany,Carretera,,1321.0,$3.00,$20.00,"$26,420.00",$-,"$26,420.00","$13,210.00","$13,210.00",1/1/14,1,January,2014
2,Midmarket,France,Carretera,,2178.0,$3.00,$15.00,"$32,670.00",$-,"$32,670.00","$21,780.00","$10,890.00",6/1/14,6,June,2014
3,Midmarket,Germany,Carretera,,888.0,$3.00,$15.00,"$13,320.00",$-,"$13,320.00","$8,880.00","$4,440.00",6/1/14,6,June,2014
4,Midmarket,Mexico,Carretera,,2470.0,$3.00,$15.00,"$37,050.00",$-,"$37,050.00","$24,700.00","$12,350.00",6/1/14,6,June,2014


In [163]:
# Let's look at couple of those data points in column Discounts
[x for x in fin_sample['Discounts']][:2]

[' $-   ', ' $-   ']

In [164]:
# Yup! Spaces again... Let's blast them out first!
fin_sample_trimmed = fin_sample.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [166]:
# Spaces are gone!
[x for x in fin_sample_trimmed['Discounts']][:2]

['$-', '$-']

In [169]:
# Now lets get rid of '$-' for real
fin_sample_trimmed_clean = fin_sample_trimmed.apply(lambda x: x.str.strip('$-') if x.dtype == "object" else x)

In [170]:
# Gone!
fin_sample_trimmed_clean.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
0,Government,Canada,Carretera,,1618.5,3.0,20.0,32370.0,,32370.0,16185.0,16185.0,1/1/14,1,January,2014
1,Government,Germany,Carretera,,1321.0,3.0,20.0,26420.0,,26420.0,13210.0,13210.0,1/1/14,1,January,2014
2,Midmarket,France,Carretera,,2178.0,3.0,15.0,32670.0,,32670.0,21780.0,10890.0,6/1/14,6,June,2014
3,Midmarket,Germany,Carretera,,888.0,3.0,15.0,13320.0,,13320.0,8880.0,4440.0,6/1/14,6,June,2014
4,Midmarket,Mexico,Carretera,,2470.0,3.0,15.0,37050.0,,37050.0,24700.0,12350.0,6/1/14,6,June,2014


In [181]:
# We will have to fix that as well, spaces need to be zeros
# Whole column needs to be a uniform type of data, all numbers or all string etc.

[x for x in fin_sample_trimmed_clean['Discounts']][:2]

['', '']

In [182]:
fin_sample_trimmed_clean.tail()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
20995,Small Business,France,Amarilla,High,2475.0,260.0,300.0,742500.0,111375.0,631125.0,618750.0,12375.0,3/1/2014,3,March,2014
20996,Small Business,Mexico,Amarilla,High,546.0,260.0,300.0,163800.0,24570.0,139230.0,136500.0,2730.0,10/1/2014,10,October,2014
20997,Government,Mexico,Montana,High,1368.0,5.0,7.0,9576.0,1436.4,8139.6,6840.0,1299.6,2/1/2014,2,February,2014
20998,Government,Canada,Paseo,High,723.0,10.0,7.0,5061.0,759.15,4301.85,3615.0,686.85,4/1/2014,4,April,2014
20999,Channel Partners,United States of America,VTT,High,1806.0,250.0,12.0,21672.0,3250.8,18421.2,5418.0,13003.2,5/1/2014,5,May,2014


In [183]:
# Next notebook we will look at converting empty spaces into valid data points and
# verifying we got rid of all spaces. Cleaning data is 90% of work!

## Resources

 * [Getting started with pandas in 10 min](https://pandas.pydata.org/docs/getting_started/10min.html)
 * [pandas Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook)
 
 
 * [Scroll to Top](#Cute-pandas-as-in-Python-package)