# Capstone Two - Pre-processing and Training Data Development

### To complete this step, you'll do the following:

- Create dummy or indicator features for categorical variables
- Standardize the magnitude of numeric features using a scaler
- Split your data into testing and training datasets

### Creation of indicator features for categorical variables

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [2]:
#Read the csv file from the Data Wrangling step
df = pd.read_csv('walmart_sales-Copy1.csv', index_col = 0)

In [3]:
#See how the df looks like
df.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,CPI,Unemployment,Close,Type,Size
0,1,1,2010-02-05,24924.5,False,42.31,2.572,211.096358,8.106,53.450001,A,151315
1,1,2,2010-02-05,50605.27,False,42.31,2.572,211.096358,8.106,53.450001,A,151315
2,1,3,2010-02-05,13740.12,False,42.31,2.572,211.096358,8.106,53.450001,A,151315
3,1,4,2010-02-05,39954.04,False,42.31,2.572,211.096358,8.106,53.450001,A,151315
4,1,5,2010-02-05,32229.38,False,42.31,2.572,211.096358,8.106,53.450001,A,151315


In [4]:
#Work with the categorical features
df_dummies = df[['Date','Weekly_Sales','IsHoliday','Store', 'Dept', 'Type']]

In [5]:
# Convert the df into dummies variable using pandas get dummies function
df_dummies = pd.get_dummies(df_dummies, columns = ['IsHoliday','Store', 'Dept', 'Type'])

In [6]:
#See how the data frame with categorial features looks like
df_dummies.head()

Unnamed: 0,Date,Weekly_Sales,IsHoliday_False,IsHoliday_True,Store_1,Store_2,Store_3,Store_4,Store_5,Store_6,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2010-02-05,24924.5,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,2010-02-05,50605.27,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,2010-02-05,13740.12,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,2010-02-05,39954.04,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,2010-02-05,32229.38,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [7]:
#Specify the df for just the categorical features for future concating
df_dummies = df_dummies.drop(columns = {'Date', 'Weekly_Sales'}, axis = 1)

In [8]:
#Specify the sales and the date data frame for future concating
df_sales = df['Weekly_Sales']
df_date = df['Date']

### Standardize the magnitude of numeric features using a scaler

In [9]:
#Extract the numeric features from the main data frame
df_numeric = df[['Date', 'Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Close']]

In [10]:
#See first rows of the data
df_numeric.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close
0,2010-02-05,24924.5,42.31,2.572,211.096358,8.106,53.450001
1,2010-02-05,50605.27,42.31,2.572,211.096358,8.106,53.450001
2,2010-02-05,13740.12,42.31,2.572,211.096358,8.106,53.450001
3,2010-02-05,39954.04,42.31,2.572,211.096358,8.106,53.450001
4,2010-02-05,32229.38,42.31,2.572,211.096358,8.106,53.450001


In [11]:
#Import necessary libraries
from sklearn.preprocessing import StandardScaler

In [12]:
#Create X and y variables
X = df_numeric.drop(columns = ['Weekly_Sales', 'Date'])
y = df_numeric['Weekly_Sales']

In [13]:
# Making a Scaler object
scaler = StandardScaler()
# Fitting data to the scaler object
scaler.fit(X)
#Transform the data
scaler_transformed = scaler.transform(X)

In [14]:
#Create a data frame with the scaled data
df_scaled = pd.DataFrame(scaler_transformed, columns = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Close'])

In [15]:
#See how the scaled function looks like
df_scaled.head()

Unnamed: 0,Temperature,Fuel_Price,CPI,Unemployment,Close
0,-0.973901,-1.724658,1.017898,0.080282,-0.578806
1,-0.973901,-1.724658,1.017898,0.080282,-0.578806
2,-0.973901,-1.724658,1.017898,0.080282,-0.578806
3,-0.973901,-1.724658,1.017898,0.080282,-0.578806
4,-0.973901,-1.724658,1.017898,0.080282,-0.578806


In [16]:
#Time to concat the scaled data frame with the date and the categorical features
df = pd.concat([df_date, df_sales, df_scaled, df_dummies], axis = 1)

In [17]:
#See how the final data frame looks like
df.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close,IsHoliday_False,IsHoliday_True,Store_1,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2010-02-05,24924.5,-0.973901,-1.724658,1.017898,0.080282,-0.578806,1,0,1,...,0,0,0,0,0,0,0,1,0,0
1,2010-02-05,50605.27,-0.973901,-1.724658,1.017898,0.080282,-0.578806,1,0,1,...,0,0,0,0,0,0,0,1,0,0
2,2010-02-05,13740.12,-0.973901,-1.724658,1.017898,0.080282,-0.578806,1,0,1,...,0,0,0,0,0,0,0,1,0,0
3,2010-02-05,39954.04,-0.973901,-1.724658,1.017898,0.080282,-0.578806,1,0,1,...,0,0,0,0,0,0,0,1,0,0
4,2010-02-05,32229.38,-0.973901,-1.724658,1.017898,0.080282,-0.578806,1,0,1,...,0,0,0,0,0,0,0,1,0,0


### Split your data into testing and training datasets

In [18]:
#Import necessary libraries
from sklearn.model_selection import train_test_split

In [19]:
#Define the X and the Y variables for the numeric variables
X = df.drop(columns = {'Date','Weekly_Sales'}).values
y = df[['Date', 'Weekly_Sales']].values

In [20]:
#Divide into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

In [21]:
#Save the column names
column_names = df.drop(columns = {'Date', 'Weekly_Sales'}).columns
y_columns = df[['Date', 'Weekly_Sales']].columns

In [22]:
#Creation of the train data frame
df_train1 = pd.DataFrame(X_train, columns = column_names)
df_train2 = pd.DataFrame(y_train, columns = y_columns)
df_train = pd.concat([df_train2, df_train1], axis = 1)

In [23]:
#See the first 5 rows of our new train data frame
df_train.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close,IsHoliday_False,IsHoliday_True,Store_1,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2012-01-27,20732.8,-1.509118,0.827211,-0.874912,-0.007219,0.420931,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2012-10-26,12399.0,0.179249,0.338742,-1.022145,-2.188849,2.40388,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2010-08-27,21588.8,0.550658,-1.323806,0.856888,-0.230536,-0.916183,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,2010-09-03,53505.4,0.20574,-1.284378,0.488524,-0.333605,-0.772969,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2011-01-21,30911.9,-0.19324,-0.051158,-1.117955,0.422772,-0.264839,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [24]:
#Convert the Date column into date type and the weekly sales into float numbers
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train['Weekly_Sales'] = df_train['Weekly_Sales'].astype(float)

In [25]:
#Create a new data frame to convert the time into categorical variables
df_date = pd.DataFrame()

In [26]:
#Convert to the year
df_date['Year'] = pd.Series(df_train['Date'])
df_date['Year'] = pd.DatetimeIndex(df_date['Year']).year

In [27]:
#Convert to the month
df_date['Month'] = pd.Series(df_train['Date'])
df_date['Month'] = pd.DatetimeIndex(df_date['Month']).month

In [28]:
#Convert to the day
df_date['Day'] = pd.Series(df_train['Date'])
df_date['Day'] = pd.DatetimeIndex(df_date['Day']).day

In [29]:
#Concat the train data frame with the categorical data frame 
df_train = pd.concat([df_date, df_train], axis = 1)

In [30]:
#Drop the date column in order to avoid redundancy
df_train.drop('Date', axis = 1, inplace = True)

In [31]:
#See the final data frame
df_train.head()

Unnamed: 0,Year,Month,Day,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close,IsHoliday_False,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2012,1,27,20732.84,-1.509118,0.827211,-0.874912,-0.007219,0.420931,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2012,10,26,12399.01,0.179249,0.338742,-1.022145,-2.188849,2.40388,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2010,8,27,21588.82,0.550658,-1.323806,0.856888,-0.230536,-0.916183,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,2010,9,3,53505.39,0.20574,-1.284378,0.488524,-0.333605,-0.772969,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2011,1,21,30911.87,-0.19324,-0.051158,-1.117955,0.422772,-0.264839,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [32]:
# Convert train to csv
df_train.to_csv('WalmartSalesTrainDataset.csv')

In [33]:
# Creation of the test data frame 
df_test1 = pd.DataFrame(X_test, columns = column_names)
df_test2 = pd.DataFrame(y_test, columns = y_columns)
df_test = pd.concat([df_test2, df_test1], axis = 1)

In [34]:
#See first 5 rows
df_test.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close,IsHoliday_False,IsHoliday_True,Store_1,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2012-01-20,36.0,-0.077006,-0.184775,1.211405,-0.382455,0.462242,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2010-03-26,4873.39,-0.014294,-0.60534,-1.139284,3.230868,-0.295134,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2012-06-22,35979.0,0.572823,0.590643,-0.844196,0.186572,1.328406,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,2010-08-06,1151.25,1.34105,-1.604184,1.11333,-0.3277,-0.807396,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2011-02-04,8718.66,-2.657403,-0.26144,-0.971329,-0.026545,-0.223528,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [35]:
#Repetition of the process that is above but instead of training, we would be using the test data
#Convert date column into Date time and weekly sales into float
df_test['Date'] = pd.to_datetime(df_test['Date'])
df_test['Weekly_Sales'] = df_test['Weekly_Sales'].astype(float)

In [36]:
#Creation of the new data frame
df_date = pd.DataFrame()

In [37]:
#Year column
df_date['Year'] = pd.Series(df_test['Date'])
df_date['Year'] = pd.DatetimeIndex(df_date['Year']).year

In [38]:
#Month column
df_date['Month'] = pd.Series(df_test['Date'])
df_date['Month'] = pd.DatetimeIndex(df_date['Month']).month

In [39]:
#Day column
df_date['Day'] = pd.Series(df_test['Date'])
df_date['Day'] = pd.DatetimeIndex(df_date['Day']).day

In [40]:
#Concat the date data frame and the test data
df_test = pd.concat([df_date, df_test], axis = 1)

In [41]:
#Drop test to avoid redundancy
df_test.drop('Date', axis = 1, inplace = True)

In [42]:
#See how the test data looks like
df_test.head()

Unnamed: 0,Year,Month,Day,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Close,IsHoliday_False,...,Dept_93,Dept_94,Dept_95,Dept_96,Dept_97,Dept_98,Dept_99,Type_A,Type_B,Type_C
0,2012,1,20,36.0,-0.077006,-0.184775,1.211405,-0.382455,0.462242,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2010,3,26,4873.39,-0.014294,-0.60534,-1.139284,3.230868,-0.295134,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2012,6,22,35978.97,0.572823,0.590643,-0.844196,0.186572,1.328406,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,2010,8,6,1151.25,1.34105,-1.604184,1.11333,-0.3277,-0.807396,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2011,2,4,8718.66,-2.657403,-0.26144,-0.971329,-0.026545,-0.223528,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [43]:
# Convert test to csv
df_test.to_csv('WalmartSalesTestDataset.csv')

### Questions

1. Does my data set have any categorical data, such as Gender or day of the week? 

The dataset I am going to work with does have many categorical features. Department's number, type of store, if it is whether holiday or not, and store's number

2. Do my features have data values that range from 0 - 100 or 0-1 or both and more? 

The dataset I'm working has different ranges. Before scaling the data, the values were between 0 and 100, and after scaling the data, we are having negative and positive values.