# Cute pandas 2

Table of Contents

* [Where we left off last](#Where-we-left-off-last)
* [Dtypes](#Dtypes)
* [Cleaning Data](#Cleaning-Data)
* [Resources](#Resources)

To run the code cells below either click `Run` in the menu above or `Help` has a list of all the `Keyboard Shortcuts`:
* `Shift + Enter` run the current cell, select below
* `Ctrl + Ente`r run selected cells
* `Alt + Enter` run the current cell, insert below
* `Ctrl + S` save and checkpoint


## Where we left off last

Here is where we left off at the end of `python_pandas_cleaning_data1.ipynb` file.
In this current file, we will take it a step further and check all the columns data, 
convert it to appropriate formats and ready for analyzing.

In [79]:
# Importing pandas package
import pandas as pd

In [80]:
#Loading csv file with accounting data
fin_sample = pd.read_csv('financial_sample.csv')

# 'Renaming' columns in place by stripping away spaces before and after column names in the existing dataframe
fin_sample.rename(columns=lambda x: x.strip(), inplace=True)

# Stripping spaces around text in dataframe
fin_sample_trimmed = fin_sample.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# Stripping away '$-' symbol from dataframe
fin_sample_trimmed_clean = fin_sample_trimmed.apply(lambda x: x.str.strip('$-') if x.dtype == "object" else x)

In [81]:
# At this point point, we are exactly where we left off last...
fin_sample_trimmed_clean.head()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
0,Government,Canada,Carretera,,1618.5,3.0,20.0,32370.0,,32370.0,16185.0,16185.0,1/1/14,1,January,2014
1,Government,Germany,Carretera,,1321.0,3.0,20.0,26420.0,,26420.0,13210.0,13210.0,1/1/14,1,January,2014
2,Midmarket,France,Carretera,,2178.0,3.0,15.0,32670.0,,32670.0,21780.0,10890.0,6/1/14,6,June,2014
3,Midmarket,Germany,Carretera,,888.0,3.0,15.0,13320.0,,13320.0,8880.0,4440.0,6/1/14,6,June,2014
4,Midmarket,Mexico,Carretera,,2470.0,3.0,15.0,37050.0,,37050.0,24700.0,12350.0,6/1/14,6,June,2014


In [82]:
fin_sample_trimmed_clean.columns

Index(['Segment', 'Country', 'Product', 'Discount Band', 'Units Sold',
       'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts',
       'Sales', 'COGS', 'Profit', 'Date', 'Month Number', 'Month Name',
       'Year'],
      dtype='object')

## Dtypes

In [83]:
# What type of data types are we dealing with here? 
# Each column has it's data type assigned/inferred during loading csv data.
# 'Object' data type means in essence a mixed data type. 
# For example 'Country' column could have strings(text) or integers(numbers) for values which is not great.
# We want each column to be a uniform data type so that we can expect how it behaves when we manipulate it
# during analysis. The output below shows us that 'Units Sold' is a float (decimal), 'Month' and 'Year' are integers.
# Every other column is an object which we need to fix.

fin_sample_trimmed_clean.dtypes

Segment                 object
Country                 object
Product                 object
Discount Band           object
Units Sold             float64
Manufacturing Price     object
Sale Price              object
Gross Sales             object
Discounts               object
Sales                   object
COGS                    object
Profit                  object
Date                    object
Month Number             int64
Month Name              object
Year                     int64
dtype: object

In [84]:
# Let's look at 'Sales' column and the first two data points. Both items look like numbers... Are they integers?
[x for x in fin_sample_trimmed_clean['Sales']][0:2]

['32,370.00', '26,420.00']

In [85]:
# We have strings on our hands... strings in essence mean text data, not numbers.
[type(x) for x in fin_sample_trimmed_clean['Sales']][0:2]

[str, str]

In [31]:
fin_sample_trimmed_clean['Discounts'] = pd.to_numeric(fin_sample_trimmed_clean['Discounts'].astype(str).str.replace(',',''), errors='coerce')
fin_sample_trimmed_clean.fillna(0, inplace=True)


In [32]:
fin_sample_trimmed_clean.tail()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
20995,Small Business,France,Amarilla,High,2475.0,260.0,300.0,742500.0,111375.0,631125.0,618750.0,12375.0,3/1/2014,3,March,2014
20996,Small Business,Mexico,Amarilla,High,546.0,260.0,300.0,163800.0,24570.0,139230.0,136500.0,2730.0,10/1/2014,10,October,2014
20997,Government,Mexico,Montana,High,1368.0,5.0,7.0,9576.0,1436.4,8139.6,6840.0,1299.6,2/1/2014,2,February,2014
20998,Government,Canada,Paseo,High,723.0,10.0,7.0,5061.0,759.15,4301.85,3615.0,686.85,4/1/2014,4,April,2014
20999,Channel Partners,United States of America,VTT,High,1806.0,250.0,12.0,21672.0,3250.8,18421.2,5418.0,13003.2,5/1/2014,5,May,2014


In [33]:
[x for x in fin_sample_trimmed_clean['Discounts']]

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 276.15,
 344.4,
 72.1,
 44.73,
 92.82,
 222.96,
 4235.0,
 177.03,
 173.4,
 412.5,
 320.52,
 91.92,
 1482.0,
 4889.5,
 7542.5,
 332.1,
 6903.0,
 275.1,
 128.1,
 7494.0,
 828.75,
 227.1,
 314.48,
 908.75,
 983.75,
 2278.75,
 112.05,
 91.92,
 8715.0,
 7542.5,
 772.8,
 25.34,
 1153.75,
 828.75,
 146.44,
 18.41,
 3302.25,
 908.75,
 983.75,
 2958.0,
 1482.0,
 4889.5,
 2180.0,
 238.68,
 48.15,
 1856.25,
 310.8,
 1284.0,
 300.3,
 19964.0,
 274.08,
 626.4,
 165.6,
 4150.0,
 708.9,
 5508.0,
 10368.0,
 274.08,
 1655.0,
 310.8,
 2022.5,
 5362.5,
 428.4,
 11496.0,
 19964.0,
 6822.5,
 577.5,
 281.82,
 253.2,
 260.16,
 626.4,
 20762.0,
 20139.0,
 2022.5,
 5362.5,
 253.2,
 217.6,
 260.16,

In [34]:
[type(x) for x in fin_sample_trimmed_clean['Sales']][0:2]

[str, str]

In [27]:
fin_sample_trimmed_clean.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21000 entries, 0 to 20999
Data columns (total 16 columns):
Segment                21000 non-null object
Country                21000 non-null object
Product                21000 non-null object
Discount Band          21000 non-null object
Units Sold             21000 non-null float64
Manufacturing Price    21000 non-null object
Sale Price             21000 non-null object
Gross Sales            21000 non-null object
Discounts              19410 non-null float64
Sales                  21000 non-null object
COGS                   21000 non-null object
Profit                 21000 non-null object
Date                   21000 non-null object
Month Number           21000 non-null int64
Month Name             21000 non-null object
Year                   21000 non-null int64
dtypes: float64(2), int64(2), object(12)
memory usage: 2.6+ MB


In [28]:
fin_sample_trimmed_clean['Discounts'].apply(type)

0        <class 'float'>
1        <class 'float'>
2        <class 'float'>
3        <class 'float'>
4        <class 'float'>
              ...       
20995    <class 'float'>
20996    <class 'float'>
20997    <class 'float'>
20998    <class 'float'>
20999    <class 'float'>
Name: Discounts, Length: 21000, dtype: object

In [29]:
fin_sample_trimmed_clean.tail()

Unnamed: 0,Segment,Country,Product,Discount Band,Units Sold,Manufacturing Price,Sale Price,Gross Sales,Discounts,Sales,COGS,Profit,Date,Month Number,Month Name,Year
20995,Small Business,France,Amarilla,High,2475.0,260.0,300.0,742500.0,111375.0,631125.0,618750.0,12375.0,3/1/2014,3,March,2014
20996,Small Business,Mexico,Amarilla,High,546.0,260.0,300.0,163800.0,24570.0,139230.0,136500.0,2730.0,10/1/2014,10,October,2014
20997,Government,Mexico,Montana,High,1368.0,5.0,7.0,9576.0,1436.4,8139.6,6840.0,1299.6,2/1/2014,2,February,2014
20998,Government,Canada,Paseo,High,723.0,10.0,7.0,5061.0,759.15,4301.85,3615.0,686.85,4/1/2014,4,April,2014
20999,Channel Partners,United States of America,VTT,High,1806.0,250.0,12.0,21672.0,3250.8,18421.2,5418.0,13003.2,5/1/2014,5,May,2014


## Resources

 * [Scroll to Top](#Cute-pandas-2)