# Semi-Supervised Learning
Code taken from [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2017/09/pseudo-labelling-semi-supervised-learning-technique/) and data is from their data hack challenge, further details available [here](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#ProblemStatement).

The problem statement being tackled in this dataset and exercise is:

> Predicting sales for Big Mart outlets

## Background
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

In [39]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [47]:
df_train = pd.read_csv(filepath_or_buffer = '../data/data_mart_train.csv')
df_test = pd.read_csv(filepath_or_buffer = '../data/data_mart_test.csv')

df_train['dataset_identifier'] = 'train'
df_test['dataset_identifier'] = 'test'

df_test['Item_Outlet_Sales'] = np.NaN

# union/concatenate dataframes so can perform similar operations
df = pd.concat([df_train, df_test])

df.sample(n = 5, random_state = 42)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,dataset_identifier
5567,FDT51,,Regular,0.010866,Meat,111.4544,OUT027,1985,Medium,Tier 3,Supermarket Type3,,test
4098,FDB27,7.575,Low Fat,0.055476,Dairy,196.8768,OUT049,1999,Medium,Tier 1,Supermarket Type1,1182.4608,train
5406,NCB07,19.2,Low Fat,0.077493,Household,197.011,OUT035,2004,Small,Tier 2,Supermarket Type1,,test
7562,FDY39,,Regular,0.0,Meat,182.0608,OUT027,1985,Medium,Tier 3,Supermarket Type3,7717.9536,train
3666,DRA59,8.27,Regular,0.128187,Soft Drinks,184.3924,OUT045,2002,,Tier 2,Supermarket Type1,,test


# Preprocessing
We pre-process the data so it is more suitable for analysis.

In [48]:
# lowercase headings
df.columns = df.columns.str.lower()

# imputing means in place of NAs
df['item_weight'] = df['item_weight'].fillna(value = df['item_weight'].mean())
df['outlet_size'] = df['outlet_size'].fillna(value = 'Small')

# reduce fat content to only two categories
df['item_fat_content'] = df['item_fat_content'].replace(to_replace = ['lower fat', 'LF', 'reg'], value = ['Low Fat', 'Low Fat', 'Regular'])

# compute establishment year
df['outlet_establishment_year'] = 2013 - df['outlet_establishment_year']

# apply label encoding for categorical variables
df_adjust = df.copy()
col = ['item_fat_content', 'outlet_size', 'outlet_location_type', 'outlet_type']
number = LabelEncoder()
for i in col:
    df_adjust[i] = number.fit_transform(df_adjust[i].astype('str'))
    df_adjust[i] = df_adjust[i].astype('int')


# remove id variables/columns
df_adjust = df_adjust.drop(labels = ['outlet_identifier', 'item_type', 'item_identifier'], axis = 1)

# define input/features and output/target
model_features = df_adjust.columns
model_target = 'item_outlet_sales'

# establish train and test sets
X_train = df_adjust.query('dataset_identifier == "train"')
X_train = X_train.drop(labels = ['item_outlet_sales', 'dataset_identifier'], axis = 1)
X_test = df_adjust.query('dataset_identifier == "test"')
X_test = X_test.drop(labels = ['item_outlet_sales', 'dataset_identifier'], axis = 1)
y_train = df_adjust.query('dataset_identifier == "train"')
y_train = y_train['item_outlet_sales']

In [None]:
df.sample(n = 5, random_state = 42)
df_adjust.sample(n = 5, random_state = 42)

In [58]:
y_train.dtypes

dtype('float64')

Store objects created for next notebook session.

In [59]:
object_keep = {'df':df,
               'df_adjust': df_adjust, 
               'model_features': model_features, 
               'model_target': model_target, 
               'X_train': X_train, 
               'X_test': X_test, 
               'y_train': y_train}
%store object_keep

Stored 'object_keep' (dict)
