In [None]:
### Project Problem and Hypothesis
* What's the project about? What problem are you solving?
* Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?
* What kind of impact do you think it could have?
* What do you think will have the most impact in predicting the value you are interested in solving for?

### Datasets
* Description of data set available, at the field level (see table)
* If from an API, include a sample return (this is usually included in API documentation!) (if doing this in markdown, use the javacription code tag)

### Domain knowledge
* What experience do you already have around this area?
* Does it relate or help inform the project in any way?
* What other research efforts exist?
    * Use a quick Google search to see what approaches others have made, or talk with your colleagues if it is work related about previous attempts at similar problems.
    * This could even just be something like "the marketing team put together a forecast in excel that doesn't do well."
    * Include a benchmark, how other models have performed, even if you are unsure what the metric means.

### Project Concerns
* What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).
* What are the assumptions and caveats to the problem?
    * What data do you not have access to but wish you had?
    * What is already implied about the observations in your data set? For example, if your primary data set is twitter data, it may not be representative of the whole sample (say, predicting who would win an election)
* What are the risks to the project?
    * What's the cost of your model being wrong? (What's the benefit of your model being right?)
    * Is any of the data incorrect? Could it be incorrect?

### Outcomes
* What do you expect the output to look like?
* What does your target audience expect the output to look like?
* What gain do you expect from your most important feature on its own?
* How complicated does your model have to be?
* How successful does your project have to be in order to be considered a "success"?
* What will you do if the project is a bust (this happens! but it shouldn't here)?


In [None]:
### Project Problem and Hypothesis
The main objective of the Big Mart Sales Prediction project is to build a model 
that will accurately predict sales for each store in the dataset at the product level. 
There are 12 variables provided in the data set and 1559 products across 10 stores in different cities. I believe

part of a competition that was hosted in 2016, so there are multiple resources online. Big Mart is a retailer in the food and staples industry with a variety of products sold in 13 stores today. 

Data Available: The Big Mart 2013 Sales Dataset includes ~8,500 rows of data with 12 variables. This set covers 1559 products across 10 stores in different cities. 

Objective: Utilize the dataset to create a model that will predict the sales for different products at each of the stores. This will be useful for Data Mart management to determine the best ways to increase sales at the various locations. 


In [None]:
### Datasets
# the datasets include 8,524 rows of data to train the predictive model and it also includes
#5,681 rows of data to test the model.
# I will combine the data sets in the next cell. 
#https://www.analyticsvidhya.com/?s=big+mart
    

Item_Identifier - unique for each product
Item_Weight - weight
Item_Fat_Content - Binary variable - low fat or regular
Item_Visibility - visibilitiy of product for the whole store expressed as %
Item_Type - food type category
Item_MRP - max price
Outlet_Identifier - store identifier
Outlet_Establishment_Year - store founding year
Outlet_Size - area of the store
Outlet_Location_Type - city type
Outlet_Type - grocery store or component of broader supermarket
Item_Outlet_Sales - target data point. null for test data. 



In [19]:
import pandas as pd
import numpy as np


#Read files:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.describe()



Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [20]:
test.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
count,4705.0,5681.0,5681.0,5681.0
mean,12.695633,0.065684,141.023273,1997.828903
std,4.664849,0.051252,61.809091,8.372256
min,4.555,0.0,31.99,1985.0
25%,8.645,0.027047,94.412,1987.0
50%,12.5,0.054154,141.4154,1999.0
75%,16.7,0.093463,186.0266,2004.0
max,21.35,0.323637,266.5884,2009.0


In [23]:
test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,source
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1,test
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1,test
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store,test
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1,test
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3,test


In [21]:
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print (train.shape, test.shape, data.shape)

(8523, 13) (5681, 12) (14204, 13)


In [28]:
#identify null data. 
#item outlet Sales null value is equal to the Test data, which makes sense because we are trying to figure that out with our model. 
#
data.apply(lambda x: sum(x.isnull()))



Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

In [29]:
train[pd.isnull(train).any(axis=1)]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,source
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800,train
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636,train
8,FDH17,16.200,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986,train
9,FDU28,19.200,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350,train
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680,train
21,FDW12,,Regular,0.035400,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432,train
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876,train
25,NCD06,13.000,Low Fat,0.099887,Household,45.9060,OUT017,2007,,Tier 2,Supermarket Type1,838.9080,train
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344,train
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362,train


In [None]:
## Domain Knowledge
    #    I don't have any experience with this type of modeling, but there are a lot resources on the internet. 
    # there are plenty of models to find on the internet with varying degrees of how well they perform.
    # there are online scoring metrics. 
    

In [None]:
### Project Concerns
    
    # I'm not as concerned about this project because of the amount of online resources and examples related to the project. 
    # I chose the project primarily to reduce the burden of my teachers needing to help. This project builds on a lot of the skills we learned in the first few classes
    # so there shouldn't be too much stuff that I will have to research that we haven't covered already.
    # the cost of the model being wrong would be incorrect sales forecast which could impact inventory management leading to lost profits. 

In [None]:
### Outcomes
                                             
 #the project output will be a model that predicts the sales for each outlet store by product. 
# we were provided data to use to train the model and to test the model as well. The model will not be too complex because there are only 12 variables. 
# the model will not be a bust because I have seen that others have had success. I will do more research and seek out some guidance if I am struggling. 
#the success of the model will be judged by the r squared and mean squared error. 