# The Lego Collector's Dilemma  

## Problem statement

You are a die hard Lego enthusiast wishing to collect as many board sets as you can. But before that you wish to be able to predict the price of a new lego product before its price is revealed so that you can budget it from your revenue. Since (luckily!), you are a data scientist in the making, you wished to solve this problem yourself. This dataset contains information on lego sets scraped from lego.com. Each observation is a different lego set with various features like how many pieces in the set, rating for the set, number of reviews per set etc. Your aim is to build a linear regression model to predict the price of a set.


## About the Dataset:
The snapshot of the data, you will be working on :

![Dataset](../images/lego_data.PNG)

You can see that some of the features of `review_difficulty`, `theme_name` and `Country Name` in the data are textual in nature. Don't worry, we have made things simple for you with some behind-the-scenes data preprocessing.  We have also modified the feature of `age`. You will be learning about all these preprocessing techinques in a later concept. For now let us concentrate on getting those Lego sets in your hands soon. :) 

![Dataset](../images/new_le.png)



## Why solve this project ?

After completing this project, you will have the better understanding of how to build a linear regression model. In this project, you will apply the following concepts.

 
- Train-test split
- Correlation between the features 
- Linear Regression
- MSE and $R^2$ Evaluation Metrics

## Data loading and splitting (A complete task)

In this task, we will load the data and take a look at the features and target variable.

We will also split the data into train set and test set for further processing.


## Instructions

* The path for the dataset file has been store in variable `path`
* Load dataset using pandas read_csv api in variable `df`
* Display first 5 columns of dataframe `df`.
* Store all the features(independent values) in  a variable called `X`
* Store the target variable (dependent value) in a variable called `y`
* Split the dataframe into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 6 `


## Hints

* Use `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6)` to split the data.


In [1]:
# Import header files

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.cross_validation import train_test_split

# Read the dataset
df = pd.read_csv("../data/lego_final.csv")

# Print first five columns
print(df.head())

# Store independent variable
X = df.drop('list_price',axis=1)

# Store dependent variable
y = df['list_price']

# Split the dataset
X_train,X_test,y_train,y_test=train_test_split(X,y ,test_size=0.3,random_state=6)

# code ends here


   ages  list_price  num_reviews  piece_count  play_star_rating  \
0    19       29.99            2          277               4.0   
1    19       19.99            2          168               4.0   
2    19       12.99           11           74               4.3   
3     5       99.99           23         1032               3.6   
4     5       79.99           14          744               3.2   

   review_difficulty  star_rating  theme_name  val_star_rating  country  
0                  0          4.5           0              4.0       20  
1                  2          5.0           0              4.0       20  
2                  2          4.3           0              4.1       20  
3                  0          4.6           1              4.3       20  
4                  1          4.6           1              4.1       20  




# Predictor Check 

In this task, we will plot the `scatter_plot` for different features vs target variable.  

This will help us identify which features are highly correlated with the target variable. 
   

### Things to ponder upon (Additional subheading for visualization purposes)

- Which of these features would be the best predictor for estimating our target variable?


# Reduce feature redundancies! 

In this task, we will try to find correlation among features by keeping an inter-feature correlation threshold of `0.75`. If two features are correlated with a value greater than `|0.75|`, we will remove one of them. 



# Is my price prediction ok?


In this task, we will be using linear regression to predict the price. 
We will load the `sklearn` model of linear regression, train it on train data and predict the outcomes of test data

We will then check the model accuracy using `r^2 score` and `mse`. 
