# Data Workflow Lab 1

Clean and summarize Project 3 data.

### Learning Objectives

* Practice text cleaning techniques
* Practice datatype conversion
* Practice filling in missing values with either 0 or the average in the column
* Practice categorical data techniques
* Transform data into usable quantities


In [1]:
% matplotlib inline
import datetime
import numpy as np
import pandas as pd

**Load the data**

In [2]:
df = pd.read_csv("CSV/Iowa_Liquor_sales_sample_10pct.csv")
print df.columns
df.head()

Index([u'Date', u'Store Number', u'City', u'Zip Code', u'County Number',
       u'County', u'Category', u'Category Name', u'Vendor Number',
       u'Item Number', u'Item Description', u'Bottle Volume (ml)',
       u'State Bottle Cost', u'State Bottle Retail', u'Bottles Sold',
       u'Sale (Dollars)', u'Volume Sold (Liters)', u'Volume Sold (Gallons)'],
      dtype='object')


Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34
3,02/03/2016,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,$9.50,$14.25,6,$85.50,10.5,2.77
4,08/18/2015,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,$7.20,$10.80,12,$129.60,21.0,5.55


## Clean the data

Let's practice our data cleaning skills on the Project 3 dataset. If you don't remember how to do any of these tasks, look back at your work from the previous weeks or search the internet. There are many blog articles and Stack Overflow posts that cover these topics.

You'll want to complete at least the following tasks:
* Remove redundant columns
* Remove "$" prices from characters and convert values to floats.
* Convert dates to pandas datetime objects
* Convert category floats to integers
* Drop or fill in bad values

**Remove redundant columns**

In [3]:
df.drop(['County', 'Category Name', 'Vendor Number', 'Item Description', 'Volume Sold (Gallons)'], axis=1, inplace=True)

In [4]:
df.head()

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,Category,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters)
0,11/04/2015,3717,SUMNER,50674,9.0,1051100.0,54436,750,$4.50,$6.75,12,$81.00,9.0
1,03/02/2016,2614,DAVENPORT,52807,82.0,1011100.0,27605,750,$13.75,$20.63,2,$41.26,1.5
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,1011200.0,19067,1000,$12.59,$18.89,24,$453.36,24.0
3,02/03/2016,2501,AMES,50010,85.0,1071100.0,59154,1750,$9.50,$14.25,6,$85.50,10.5
4,08/18/2015,3654,BELMOND,50421,99.0,1031080.0,35918,1750,$7.20,$10.80,12,$129.60,21.0


**Remove $ from certain columns**

In [5]:
df['State Bottle Cost'] = df['State Bottle Cost'].str.extract('([^$][0-9.]*)').astype(float)
df['State Bottle Retail'] = df['State Bottle Retail'].str.extract('([^$][0-9.]*)').astype(float)
df['Sale (Dollars)'] = df['Sale (Dollars)'].str.extract('([^$][0-9.]*)').astype(float)
df.head()

  from ipykernel import kernelapp as app
  app.launch_new_instance()


Unnamed: 0,Date,Store Number,City,Zip Code,County Number,Category,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters)
0,11/04/2015,3717,SUMNER,50674,9.0,1051100.0,54436,750,4.5,6.75,12,81.0,9.0
1,03/02/2016,2614,DAVENPORT,52807,82.0,1011100.0,27605,750,13.75,20.63,2,41.26,1.5
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,1011200.0,19067,1000,12.59,18.89,24,453.36,24.0
3,02/03/2016,2501,AMES,50010,85.0,1071100.0,59154,1750,9.5,14.25,6,85.5,10.5
4,08/18/2015,3654,BELMOND,50421,99.0,1031080.0,35918,1750,7.2,10.8,12,129.6,21.0


**Convert dates**

In [6]:
df.Date = pd.to_datetime(df["Date"])
df.head()

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,Category,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters)
0,2015-11-04,3717,SUMNER,50674,9.0,1051100.0,54436,750,4.5,6.75,12,81.0,9.0
1,2016-03-02,2614,DAVENPORT,52807,82.0,1011100.0,27605,750,13.75,20.63,2,41.26,1.5
2,2016-02-11,2106,CEDAR FALLS,50613,7.0,1011200.0,19067,1000,12.59,18.89,24,453.36,24.0
3,2016-02-03,2501,AMES,50010,85.0,1071100.0,59154,1750,9.5,14.25,6,85.5,10.5
4,2015-08-18,3654,BELMOND,50421,99.0,1031080.0,35918,1750,7.2,10.8,12,129.6,21.0


**Drop or replace bad values and convert to integers**

In [7]:
df.dropna(inplace=True)
df['County Number'] = df['County Number'].astype(int)
df['Category'] = df['Category'].astype(int)

**One zip code was entered as part of a number, so I looked up correct zip code and replaced it**

In [8]:
df = df.replace('712-2', '51529')
df['Zip Code'] = df['Zip Code'].astype(int)
df.dtypes

Date                    datetime64[ns]
Store Number                     int64
City                            object
Zip Code                         int64
County Number                    int64
Category                         int64
Item Number                      int64
Bottle Volume (ml)               int64
State Bottle Cost              float64
State Bottle Retail            float64
Bottles Sold                     int64
Sale (Dollars)                 float64
Volume Sold (Liters)           float64
dtype: object

## Filter the Data

Some stores may have opened or closed in 2015. These data points will heavily skew our models, so we need to filter them out or find a way to deal with them.

You'll need to provide a summary in your project report about these data points. You may also consider using the monthly sales in your model and including other information (number of months or days each store is open) in your data to handle these unusual cases.

Let's record the first and last sales dates for each store. We'll save this information for later when we fit our models.

**Added 'First Date' and 'Last Date' columns**

In [9]:
dates = pd.pivot_table(df, index="Store Number", values="Date", aggfunc=(min, max))
df = df.merge(dates, left_on='Store Number', right_index=True)
df

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,Category,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),min,max
0,2015-11-04,3717,SUMNER,50674,9,1051100,54436,750,4.50,6.75,12,81.00,9.00,2015-01-07,2016-03-30
15,2015-06-10,3717,SUMNER,50674,9,1051100,54436,750,4.50,6.75,4,27.00,3.00,2015-01-07,2016-03-30
3292,2015-01-21,3717,SUMNER,50674,9,1081900,75211,750,10.00,15.00,2,30.00,1.50,2015-01-07,2016-03-30
3399,2015-11-11,3717,SUMNER,50674,9,1012100,11777,1000,6.63,9.95,2,19.90,2.00,2015-01-07,2016-03-30
3691,2015-01-07,3717,SUMNER,50674,9,1011200,19476,750,16.79,25.19,1,25.19,0.75,2015-01-07,2016-03-30
3778,2015-02-18,3717,SUMNER,50674,9,1012300,15621,600,11.88,17.82,1,17.82,0.60,2015-01-07,2016-03-30
3804,2016-03-23,3717,SUMNER,50674,9,1011100,25607,1000,8.00,12.00,2,24.00,2.00,2015-01-07,2016-03-30
4073,2015-08-19,3717,SUMNER,50674,9,1062300,44520,750,6.83,10.25,1,10.25,0.75,2015-01-07,2016-03-30
5611,2015-12-09,3717,SUMNER,50674,9,1012100,11296,750,15.00,22.50,2,45.00,1.50,2015-01-07,2016-03-30
6700,2016-02-10,3717,SUMNER,50674,9,1081030,67527,1000,15.49,23.24,1,23.24,1.00,2015-01-07,2016-03-30


In [10]:
df['First Date'] = df['min']
df['Last Date'] = df['max']

In [11]:
df.drop('min', axis=1, inplace=True)
df.drop('max', axis=1, inplace=True)
df

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,Category,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),First Date,Last Date
0,2015-11-04,3717,SUMNER,50674,9,1051100,54436,750,4.50,6.75,12,81.00,9.00,2015-01-07,2016-03-30
15,2015-06-10,3717,SUMNER,50674,9,1051100,54436,750,4.50,6.75,4,27.00,3.00,2015-01-07,2016-03-30
3292,2015-01-21,3717,SUMNER,50674,9,1081900,75211,750,10.00,15.00,2,30.00,1.50,2015-01-07,2016-03-30
3399,2015-11-11,3717,SUMNER,50674,9,1012100,11777,1000,6.63,9.95,2,19.90,2.00,2015-01-07,2016-03-30
3691,2015-01-07,3717,SUMNER,50674,9,1011200,19476,750,16.79,25.19,1,25.19,0.75,2015-01-07,2016-03-30
3778,2015-02-18,3717,SUMNER,50674,9,1012300,15621,600,11.88,17.82,1,17.82,0.60,2015-01-07,2016-03-30
3804,2016-03-23,3717,SUMNER,50674,9,1011100,25607,1000,8.00,12.00,2,24.00,2.00,2015-01-07,2016-03-30
4073,2015-08-19,3717,SUMNER,50674,9,1062300,44520,750,6.83,10.25,1,10.25,0.75,2015-01-07,2016-03-30
5611,2015-12-09,3717,SUMNER,50674,9,1012100,11296,750,15.00,22.50,2,45.00,1.50,2015-01-07,2016-03-30
6700,2016-02-10,3717,SUMNER,50674,9,1081030,67527,1000,15.49,23.24,1,23.24,1.00,2015-01-07,2016-03-30


**Made store_info dataframe with City, County Number, and Zip Code**

In [12]:
store_info = pd.pivot_table(df, index=['Store Number', 'City'], values=['Zip Code', "County Number"])
store_info.reset_index(inplace=True)
store_info.set_index('Store Number', inplace=True)
store_info

Unnamed: 0_level_0,City,County Number,Zip Code
Store Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2106,CEDAR FALLS,7,50613
2113,GOWRIE,94,50543
2130,WATERLOO,7,50703
2152,ROCKWELL,17,50469
2178,WAUKON,3,52172
2190,DES MOINES,77,50314
2191,KEOKUK,56,52632
2200,SAC CITY,81,50583
2205,CLARINDA,73,51632
2228,WINTERSET,61,50273


## Compute New Columns and Tables

Since we're trying to predict sales and/or profits, we'll want to compute some intermediate data. There are a lot of ways to do thisand good use of pandas is crucial. For example, for each transaction we may want to know:
* margin, retail cost minus bottle cost
* price per bottle
* price per liter

We'll need to make a new dataframe that indexes quantities by store:
* sales per store for all of 2015
* sales per store for Q1 2015
* sales per store for Q1 2016
* total volumes sold
* mean transaction revenue, gross margin, price per bottle, price per liter, etc.
* average sales per day
* number of days open

Make sure to retain other variables that we'll want to use to build our models, such as zip code, county number, city, etc. We recommend that you spend some time thinking about the model you may want to fit and computing enough of the suggested quantities to give you a few options.

Bonus tasks:
* Restrict your attention to stores that were open for all of 2015 and Q1 2016. Stores that opened or closed in 2015 will introduce outliers into your data.
* For each transaction we have the item category. You may be able to determine the store type (primarily wine, liquor, all types of alcohol, etc.) by the most common transaction category for each store. This could be a useful categorical variable for modelling. 

**Added 'Margin' and 'Price per Liter' column**

In [13]:
# Margin and Price per liter
df['Margin'] = (df['State Bottle Retail'] - df['State Bottle Cost'])*df['Bottles Sold']
df['Price per Liter'] = df['Sale (Dollars)'] / df['Volume Sold (Liters)']

**Filtered out stores that opened or closed during our timeframe**

In [14]:
df = df[(df['First Date'] < pd.Timestamp("2015-01-19")) & (df['Last Date'] > pd.Timestamp("2016-03-17"))]

**Sales by store for 2015**

In [20]:
# Created with pivot tables
sales2015 = pd.pivot_table(df[(df['Date'] >= pd.Timestamp("2015-01-01")) & (df['Date'] <= pd.Timestamp("2015-12-31"))], 
                           index='Store Number',
                           aggfunc={
                                    'Bottles Sold': sum,
                                    'State Bottle Retail': np.mean,
                                    'Volume Sold (Liters)': {'Total Volume Sold': sum, 'Average Volume Sold': np.mean},
                                    'Margin': np.mean,
                                    'Price per Liter': np.mean,
                                    'Sale (Dollars)': {'TotalSales (Dollars)': sum, 
                                                       'AvgSaleRevenue': np.mean, 
                                                       'TotalTransactions': len}
                                    }
                            )
# Removed multi-index
sales2015.columns = [' '.join(col).strip() for col in sales2015.columns.values]
# Merged store_info dataframe and renamed columns
sales2015 = sales2015.merge(store_info, how='left', left_index=True, right_index=True)
sales2015.rename(columns={'Sale (Dollars) AvgSaleRevenue': 'Average Sale (Dollars)', 
                          'Sale (Dollars) TotalSales (Dollars)': 'Total Sales (Dollars)',
                          'Sale (Dollars) TotalTransactions': 'Total Transactions',
                          'State Bottle Retail mean': 'Average Price per Bottle (Dollars)',
                          'Price per Liter mean': 'Average Price per Liter (Dollars)',
                          'Bottles Sold sum': 'Total Bottles Sold',
                          'Volume Sold (Liters) Average Volume Sold': 'Average Volume per Sale (Liters)',
                          'Volume Sold (Liters) Total Volume Sold': 'Total Volume Sold (Liters)',
                          'Margin mean': 'Average Margin'}, inplace=True)
sales2015

Unnamed: 0_level_0,Average Sale (Dollars),Total Sales (Dollars),Total Transactions,Average Price per Bottle (Dollars),Average Price per Liter (Dollars),Total Bottles Sold,Average Volume per Sale (Liters),Total Volume Sold (Liters),Average Margin,City,County Number,Zip Code
Store Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2106,277.658861,146326.22,527.0,15.475863,17.856601,10367,18.466509,9731.85,92.671879,CEDAR FALLS,7,50613
2113,63.334830,9310.22,147.0,16.315646,18.507700,671,4.488776,659.85,21.149932,GOWRIE,94,50543
2130,285.386301,111871.43,392.0,14.764286,16.835809,7430,17.580026,6891.37,95.217347,WATERLOO,7,50703
2178,102.633671,24324.18,237.0,14.558692,16.053844,1928,8.089114,1917.12,34.454430,WAUKON,3,52172
2190,92.539209,121689.06,1315.0,17.289726,23.306227,11111,4.807734,6322.17,30.888008,DES MOINES,77,50314
2191,209.888406,125093.49,596.0,17.327265,19.064053,7696,13.512282,8053.32,70.040923,KEOKUK,56,52632
2200,56.604342,22811.55,403.0,16.828337,16.706969,1668,4.509280,1817.24,18.991241,SAC CITY,81,50583
2228,72.758625,17462.07,240.0,14.613458,17.890101,1312,5.698542,1367.65,24.351417,WINTERSET,61,50273
2233,122.627967,29553.34,241.0,17.535021,19.945082,2253,10.293320,2480.69,40.940498,SPIRIT LAKE,30,51360
2248,147.134913,67682.06,460.0,22.581913,28.818620,3401,6.120326,2815.35,49.096761,DES MOINES,77,50312


**Sales by store for Q1 of 2015**

In [16]:
salesQ115 = pd.pivot_table(df[(df['Date'] >= pd.Timestamp("2015-01-01")) & (df['Date'] <= pd.Timestamp("2015-03-31"))], 
                           index='Store Number',
                           aggfunc={
                                    'Bottles Sold': sum,
                                    'State Bottle Retail': np.mean,
                                    'Volume Sold (Liters)': {'Total Volume Sold': sum, 'Average Volume Sold': np.mean},
                                    'Margin': np.mean,
                                    'Price per Liter': np.mean,
                                    'Sale (Dollars)': {'TotalSales (Dollars)': sum, 
                                                       'AvgSaleRevenue': np.mean, 
                                                       'TotalTransactions': len}
                                    }
                            )
salesQ115.columns = [' '.join(col).strip() for col in salesQ115.columns.values]
salesQ115 = salesQ115.merge(store_info, how='left', left_index=True, right_index=True)
salesQ115.rename(columns={'Sale (Dollars) AvgSaleRevenue': 'Average Sale (Dollars)', 
                          'Sale (Dollars) TotalSales (Dollars)': 'Total Sales (Dollars)',
                          'Sale (Dollars) TotalTransactions': 'Total Transactions',
                          'State Bottle Retail mean': 'Average Price per Bottle (Dollars)',
                          'Price per Liter mean': 'Average Price per Liter (Dollars)',
                          'Bottles Sold sum': 'Total Bottles Sold',
                          'Volume Sold (Liters) Average Volume Sold': 'Average Volume per Sale (Liters)',
                          'Volume Sold (Liters) Total Volume Sold': 'Total Volume Sold (Liters)',
                          'Margin mean': 'Average Margin'}, inplace=True)
salesQ115

Unnamed: 0_level_0,Average Sale (Dollars),Total Sales (Dollars),Total Transactions,Average Price per Bottle (Dollars),Average Price per Liter (Dollars),Total Bottles Sold,Average Volume per Sale (Liters),Total Volume Sold (Liters),Average Margin,City,County Number,Zip Code
Store Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2106,304.552636,39287.29,129.0,15.075271,17.846608,2705,19.582171,2526.10,101.615271,CEDAR FALLS,7,50613
2113,67.458333,2833.25,42.0,15.821190,19.365167,196,4.216905,177.11,22.493333,GOWRIE,94,50543
2130,278.995057,24272.57,87.0,15.401379,17.565430,1533,16.635057,1447.25,93.203218,WATERLOO,7,50703
2178,122.008542,5856.41,48.0,14.748542,16.705494,490,8.537708,409.81,40.860000,WAUKON,3,52172
2190,84.878732,29452.92,347.0,16.376282,21.932514,2557,4.802824,1666.58,28.323631,DES MOINES,77,50314
2191,192.619669,29085.57,151.0,17.453974,18.591907,1868,12.962119,1957.28,64.393377,KEOKUK,56,52632
2200,58.338452,4900.43,84.0,17.595357,16.791219,338,4.377619,367.72,19.545238,SAC CITY,81,50583
2228,86.566167,5193.97,60.0,14.606833,17.102040,372,6.760333,405.62,28.940500,WINTERSET,61,50273
2233,124.989535,5374.55,43.0,17.556512,18.350133,368,9.763953,419.85,41.767674,SPIRIT LAKE,30,51360
2248,125.382970,12663.68,101.0,22.087327,26.738835,665,6.431089,649.54,41.853861,DES MOINES,77,50312


**Sales by store for Q1 of 2016**

In [17]:
salesQ116 = pd.pivot_table(df[(df['Date'] >= pd.Timestamp("2016-01-01")) & (df['Date'] <= pd.Timestamp("2016-03-31"))], 
                           index='Store Number',
                           aggfunc={
                                    'Bottles Sold': sum,
                                    'State Bottle Retail': np.mean,
                                    'Volume Sold (Liters)': {'Total Volume Sold': sum, 'Average Volume Sold': np.mean},
                                    'Margin': np.mean,
                                    'Price per Liter': np.mean,
                                    'Sale (Dollars)': {'TotalSales (Dollars)': sum, 
                                                       'AvgSaleRevenue': np.mean, 
                                                       'TotalTransactions': len}
                                    }
                            )
salesQ116.columns = [' '.join(col).strip() for col in salesQ116.columns.values]
salesQ116 = salesQ116.merge(store_info, how='left', left_index=True, right_index=True)
salesQ116.rename(columns={'Sale (Dollars) AvgSaleRevenue': 'Average Sale (Dollars)', 
                          'Sale (Dollars) TotalSales (Dollars)': 'Total Sales (Dollars)',
                          'Sale (Dollars) TotalTransactions': 'Total Transactions',
                          'State Bottle Retail mean': 'Average Price per Bottle (Dollars)',
                          'Price per Liter mean': 'Average Price per Liter (Dollars)',
                          'Bottles Sold sum': 'Total Bottles Sold',
                          'Volume Sold (Liters) Average Volume Sold': 'Average Volume per Sale (Liters)',
                          'Volume Sold (Liters) Total Volume Sold': 'Total Volume Sold (Liters)',
                          'Margin mean': 'Average Margin'}, inplace=True)
salesQ116

Unnamed: 0_level_0,Average Sale (Dollars),Total Sales (Dollars),Total Transactions,Average Price per Bottle (Dollars),Average Price per Liter (Dollars),Total Bottles Sold,Average Volume per Sale (Liters),Total Volume Sold (Liters),Average Margin,City,County Number,Zip Code
Store Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2106,240.344488,30523.75,127.0,15.614567,18.064496,2220,16.675197,2117.75,80.233701,CEDAR FALLS,7,50613
2113,55.835135,2065.90,37.0,16.077297,17.483024,159,4.783784,177.00,18.742973,GOWRIE,94,50543
2130,238.086410,27856.11,117.0,15.932308,17.452157,1726,13.306838,1556.90,79.387094,WATERLOO,7,50703
2178,96.353448,5588.50,58.0,13.915345,15.101929,480,8.979310,520.80,32.220000,WAUKON,3,52172
2190,110.982926,34515.69,311.0,19.250836,26.651232,2567,4.879293,1517.46,37.007331,DES MOINES,77,50314
2191,331.036364,47338.20,143.0,17.015245,18.074705,2610,20.476364,2928.12,110.381329,KEOKUK,56,52632
2200,50.913049,4174.87,82.0,17.038780,17.534613,262,3.887561,318.78,16.975854,SAC CITY,81,50583
2228,46.763333,3086.38,66.0,15.306667,19.542326,229,3.164394,208.85,15.662727,WINTERSET,61,50273
2233,133.115306,6522.65,49.0,22.361020,24.754852,478,9.747347,477.62,44.402653,SPIRIT LAKE,30,51360
2248,145.241800,14524.18,100.0,22.615500,29.831242,714,5.627600,562.76,48.452100,DES MOINES,77,50312


Proceed with any calculations that you need for your models, such as grouping
sales by zip code, most common vendor number per store, etc. Once you have finished adding columns, be sure to save the dataframe.

In [19]:
sales.to_csv("sales.csv")

KeyError: 'Date amin'