### Step Two update - Aug 1st

After talking to my supervisor, I want to familiar myself with formal procedures of data preprocessing. So I re-organize the document into current format. 

### Step One update
I am trying to improve my machine learning skills. For that, I create this notebook to take a finished competition here: https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather. The data is downloaded from the competition page and re-uploaded. For a detailed information regarding this dataset, one can refer to the competition page.

This document archive my development progress on this project. More content will be added as my progress goes further. As a conclusion, all the codes will be transformed to Github. 


# Preprocessing

The procedure of preprocessing is as follows: data merge, data summarize, outliers detection and explaination, data interplation, data selection.

# Data Merge

To start, the goal is to predict goods sales based on weather data. The sales record is stored in train.csv, and weather record is stored in weather.csv.  Data from key.csv indicates the corresponding relationship between the store and weather station.  So a logical step, to start the whole process, would be to merge the information from the datasets together.

To merge datasets into one:

In [1]:
import pandas as pd
import numpy as np
import sys
import re
df_key = pd.read_csv("../input/key.csv")
df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")
df_weather = pd.read_csv("../input/weather.csv")

df_train['date'] = pd.to_datetime(df_train['date'])
df_weather['date'] = pd.to_datetime(df_weather['date'])

temp = pd.merge(df_train, df_key,how='left', on=['store_nbr'])
df_main_train = pd.merge(temp, df_weather, how='left', on=['station_nbr','date'])

print(df_train.shape)
print(temp.shape)
print(df_main_train.shape)
print(list(df_main_train))

(4617600, 4)
(4617600, 5)
(4617600, 23)
['date', 'store_nbr', 'item_nbr', 'units', 'station_nbr', 'tmax', 'tmin', 'tavg', 'depart', 'dewpoint', 'wetbulb', 'heat', 'cool', 'sunrise', 'sunset', 'codesum', 'snowfall', 'preciptotal', 'stnpressure', 'sealevel', 'resultspeed', 'resultdir', 'avgspeed']


The weather station number is first merged to sales record based on store number, and both time and station number are used to merge sales record and weather record.

## Data Summarize

From above, we can see the following index appeared in the final dataset:

date: year-month-day format
store_nbr: Walmart store number
item_nbr: item number, 117 of them, each number indicates one item, we do not have further information about what precise item would that be. 
units: number of item sold on that day
station_nbr: weather station number
tmax, tmin, tavg, depart, dewpoint, wetbulb: temperature max, min, average, departure from normal, average dew point, average wet bulb. in Fahrenheit

heat, cool: not sure what it means

sunrise, sunset: time of sunrise and sunset
codesum: special code in letters indicating the weather conditions of that day, such as RA as rain, SN as snowing 
snowfall: snow/ice on ground in inches at 1200 UTC
preciptotal: 24-hour snow drop in inches
stnpressure: air pressure
sealevel: in meters
resultspeed: resultant wind speed, miles per hour
resultdir: resultant wind direciton, in degrees
avgspeed: average wind speed, miles per hour

The following is just getting familiar with dataset and pandas in general: ploting item 6 sales per month, snow drop per month, and daily sales vs snow drop plot

In [None]:
df = df_main_train
df['year'], df['month'] = df['date'].dt.year, df['date'].dt.month
mask = (df['item_nbr'] == 6)
df = df.loc[mask]

df2 = df[['month','year','units']]

import matplotlib.pyplot as plt
count2 = df2.groupby(['month','year'])
totalsum = count2['units'].aggregate(np.sum).unstack()
##
x = totalsum.values.reshape(-1,1)
##
totalsum.plot(kind = 'bar', title = 'units')
plt.ylabel('count')
plt.show()

In [None]:
df3 = df[['preciptotal','month','year']]
df3['preciptotal'] = df3['preciptotal'].convert_objects(convert_numeric=True)
df3.interpolate()
import matplotlib.pyplot as plt

count3 = df3.groupby(['month','year'])
totalsum = count3['preciptotal'].aggregate(np.sum).unstack()
##
y = totalsum.values.reshape(-1,1)
##
totalsum.plot(kind = 'bar', title = 'preciptotal')
plt.ylabel('count')
plt.show()

<font color='red'>convert_objects keeps giving me warning, any other alternatives? </font>

In [None]:
#plt.plot(x, y, 'o', label="data")

x1 = df2['units'].values.reshape(-1,1)
y1 = df3['preciptotal'].values.reshape(-1,1)

plt.plot(x1, y1, 'o', label="data")

## Weather Event Locate & Data interpolation

Highlight the data for the weather events, which is defined as rainy days with 1 inch or more rainfall, or snowy days with 2 inch or more snowfall. 

<font color='red'>Problem: The goal here is not only focusing on event days, but 3 days before and after event days as well. What would be a convient way to mark those days</font> 

For data interpolation, pandas provide a convenient function: pd.interpolate()

In [None]:
df7 = df_main_train

df7 = df7.convert_objects(convert_numeric=True)
df7.interpolate()


patternRA = 'RA'
patternSN = 'SN'
df7['RA'], df7['SN'] = df7['codesum'].str.contains(patternRA), df7['codesum'].str.contains(patternSN)
df7['Condition'] = (df7['RA'] & (df7['preciptotal']>1.0)) | (df7['SN'] & (df7['preciptotal']>2.0))

mask = (df7['Condition'] == True)
df8 = df7.loc[mask]

print(df8.shape)

## Outlier Detection

Looking for outliers, defined by numbers 3 std away from the main

In [None]:
df9 = df8[['date','preciptotal']]

df9.preciptotal.mean()

df10 = df9[np.abs(df9.preciptotal-df9.preciptotal.mean())>(3*df9.preciptotal.std())]

grouped_df = df10.groupby(['preciptotal'])['date']

for key, item in grouped_df:
    print(key)

As the most important data, 7.36 inch rainfall seems to be ok?...

In [None]:
df9 = df8[['date','tavg']]

df9['tavg'] = df9['tavg'].convert_objects(convert_numeric=True)
df9.interpolate()

df9.tavg.mean()

df10 = df9[np.abs(df9.tavg-df9.tavg.mean())>(3*df9.tavg.std())]

grouped_df = df10.groupby(['tavg'])['date']

for key, item in grouped_df:
    print(key)

-4 degree is the coldest case, as one lived in north Canada I envy those guys.

In [None]:
df9 = df8[['date','avgspeed']]

df9['avgspeed'] = df9['avgspeed'].convert_objects(convert_numeric=True)
df9.interpolate()

df9.avgspeed.mean()

df10 = df9[np.abs(df9.avgspeed-df9.avgspeed.mean())>(3*df9.avgspeed.std())]

grouped_df = df10.groupby(['avgspeed'])['date']

for key, item in grouped_df:
    print(key)

nothing found at windspeed

## Data Selection: VIF

<font color='red'>Problem: Not sure how to logically use it, so I make three groups based on tempature, wind, and rainfall/snowfall</font>

In [None]:
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

df11 = df8

mask = (df11['item_nbr'] == 11)
df11 = df11.loc[mask]

df12 = df11[['units','tmax','tmin','tavg','depart','dewpoint','wetbulb','heat','cool']]
df12 = df12.convert_objects(convert_numeric=True).dropna()
df12 = df12._get_numeric_data()
df12.reset_index(drop=True)




In [None]:
df13 = df12[['tmax','tmin','tavg','depart','dewpoint','wetbulb','heat','cool']]

features = "+".join(df13.columns)
y, X = dmatrices('units ~' + features, df12, return_type='dataframe')

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif.round(1)

In [None]:

df11 = df8

mask = (df11['item_nbr'] == 11)
df11 = df11.loc[mask]

df12 = df11[['units','snowfall','preciptotal']]
df12 = df12.convert_objects(convert_numeric=True).dropna()
df12 = df12._get_numeric_data()
df12.reset_index(drop=True)

df13 = df12[['snowfall','preciptotal']]

features = "+".join(df13.columns)
y, X = dmatrices('units ~' + features, df12, return_type='dataframe')

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif.round(1)

In [None]:

df11 = df8

mask = (df11['item_nbr'] == 11)
df11 = df11.loc[mask]

df12 = df11[['units','stnpressure','sealevel','resultspeed','resultdir','avgspeed']]
df12 = df12.convert_objects(convert_numeric=True).dropna()
df12 = df12._get_numeric_data()
df12.reset_index(drop=True)

df13 = df12[['stnpressure','sealevel','resultspeed','resultdir','avgspeed']]

features = "+".join(df13.columns)
y, X = dmatrices('units ~' + features, df12, return_type='dataframe')

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif.round(1)

<font color='red'>Problem: How do I remove data based on this result?</font>