In this tutorial we are going to predict number of points, based on input features. I will try to show whole pipeline in solving this type of problems.

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
Machine learning problem typically are divided into two groups.
Let's define our problem in simple math terms. 

* x will be our input
* y will be vale of  <mi>f(x)</mi> - in our case points.

<center>
  <mi>f</mi>
  <mo>:</mo>
  <mi>x</mi>
  <mo stretchy="false">&#x2192;<!-- → --></mo>
  <mi>y</mi>
</math></center>

1. **Regression** - If y is real number or continuous
 * Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
2. **Classification** - If y is discrete or categorical variable
 * Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

We can easily figure out that our problem is Regression problem, because we want to predict number of points, which is continuose variable. 

We have few ways to solve this problem. Actually most common ways are: 

1. **Neural networks**
2. Bagging and Boosting decision trees - **Random forest**

> * Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.
> * Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest

> **Reference**: https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

So, what we have to do first? 

In this kernel I will do it in that pattern, which is commonly used for solving this type of problems: 
1. Data analaysis
2. Data visualisation
3. Feature selection
4. Model training
5. Feature extraction with PCA 

There is great visualisation of typical pipeline:
![logo](https://cdn-images-1.medium.com/max/2000/1*2T5rbjOBGVFdSvtlhCqlNg.png)
        A standard machine learning pipeline (source: Practical Machine Learning with Python, Apress/Springer)

# 1. Data analaysis
Before we go into more complicated work, first we have to explore our dataset.

Let's have a quick look at our features.

In [None]:
import numpy as np
import pandas as pd
import os

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

print(os.listdir("../input"))


In [None]:
data=pd.read_csv('../input/winemag-data-130k-v2.csv')
data.head()

We see that there is a column called 'Unnamed: 0', which contains IDs of each wine. IDs are of course can't help us in order to regression, so we should drop this column.  We will also drop description column, because in this kernel we will not play with NLP.

In [None]:
data=data.drop(columns=['Unnamed: 0', 'description'])
data=data.reset_index(drop=True)

Now, we want to explore our features in more statistic way.
We will use describe method from pandas. 
It will return us information about:
* mean
* standard deviation
* minimum value
* maximum value
* 25%,50%,75% quantille

In [None]:
data.describe()

As we can see only price is continous variable in our input. As we can see on minimum and maximium value there is really high 
diverse in price feature. There is wine which cost 3300 dollars , but we can see that 75 percent of wines are cheaper than 42 dollars.

### **Duplicates.**
First of all let's explore our data. On first look into data we can see that there are many duplicates, which we have to drop.

Let's see how many duplicates are in the data.

In [None]:
print("Total number of examples: ", data.shape[0])
print("Number of examples with the same title and description: ", data[data.duplicated(['description','title'])].shape[0])

We can see that there are almost 10k records with the same title and description. We should drop rows columns in order to get proper result.

In [None]:
data=data.drop_duplicates(['description','title'])
data=data.reset_index(drop=True)

### Missing values.
Now, we will investigate our dataset in order to see how many missing values there is. 

In [None]:
data.info()

We see that there is huge number of missing values. Let's see how many percent.

In [None]:
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

The most missing values are in region, destination, tester name and price columns.

I'm worried the most about wines with NaN in price columns. We don't want to predict points for wines which price are undeclared. We will drop rows with NaN value in this column.

Usefulness of other columns will be investigate on the **Feature extraction ** stage. Maybe that NaN values are meaningful for particular columns..

In [None]:
data=data.dropna(subset=['price'])
data=data.reset_index(drop=True)

Let's take a quick look also on highest priced wines. 

In [None]:
data[(data['price'] > 2200)]

All of 3 highest priced wines are comes from France. 

# Data visualization
Remeber that in this stage our goal is not only to explore our data in order to get better predictions. We also want to get better understanding what is in data and explore data in 'normal' way. This kind of approch can be useful if we have to do some feature engineering, where good data understanding can really help to produce better features. 

The most common ways to visualize data are:
* histograms
* box plots
* swarm plots
* joint plot
* heatmaps

Data can be visualized by **matplotlib, seaborn library** and **built in methods from pandas dataframes**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def pastel_plot(data, x, y):
    plt.figure(figsize = (15,6))
    plt.title('Points histogram - whole dataset')
    sns.set_color_codes("pastel")
    sns.barplot(x = x, y=y, data=df)
    locs, labels = plt.xticks()
    plt.show()

In [None]:
temp = data["points"].value_counts()
df = pd.DataFrame({'points': temp.index,
                   'number_of_wines': temp.values
                  })

pastel_plot(df,'points', 'number_of_wines')

We can see that all wines have number of points above 80. And points has normal distribution. The most wines have 88 points.

We can also get exact distribution not only the histogram. We will show it on price column

In [None]:
plt.figure(figsize=(20,5))
plt.title("Distribution of price")
ax = sns.distplot(data["price"])

We see that if we want to see better price distribution we have to scale our price or drop the tail. 
We will drop the tail, so the values that are above 200 dollars. We are also want to calculate how many wines are more expensive then 200 dolars. 

In [None]:
plt.figure(figsize=(20,5))
plt.title("Distribution of price")
ax = sns.distplot(data[data["price"]<200]['price'])

percent=data[data['price']>200].shape[0]/data.shape[0]*100
print("There are :", percent, "% wines more expensive then 200 USD")

As we can see we dropped only 0.59 percent of wines and now we can see that price distribution is also normal. 

Let's investigate which country have most expensive and most high rated wines. First of all we will sort it by price and then plot.

In [None]:
z=data.groupby(['country'])['price','points'].mean().reset_index().sort_values('price',ascending=False)
z[['country','price']].head(n=10)

In [None]:
plt.figure(figsize = (14,6))
plt.title('Wine prices in diffrent countries')
sns.barplot(x = 'country', y="price", data=z.head(10))
locs, labels = plt.xticks()
plt.show()

In [None]:
z=z.sort_values('points', ascending=False)
z[['country','points']].head(10)

In [None]:
plt.figure(figsize = (14,6))
plt.title('Points for wines in diffrent countries')
sns.set_color_codes("pastel")
sns.barplot(x = 'country', y="points", data=z.head(5))
locs, labels = plt.xticks()
plt.show()

We can easily note, that the wines in Switzerland are the most expensive one. I think the most impactful factor is much higher prices for all goods in this country. 
The highest mean of points came to England 
Based on our data let's try make some guesses why England wines are the best.
* Most sommeliers come from England
* England provide information only for thier best wines
* They are simply the best :)

We can partly check our second guess. Let's see how many wines are in dataset from particular country.

In [None]:
country=data['country'].value_counts()
country.head(10).plot.bar()
country.head(20)

We can see that England isn't even in first 20's, so our guess make more sense. ;)

To solve our 'problem' Important thing to investegate will be also price/quality factor.

In [None]:
z['quality/price']=z['points']/z['price']
z.sort_values('quality/price', ascending=False)[['country','quality/price']]

What can we see now? England was first on points ranking, but on points/quality ranking they are the second from the end. 
So, yeah, they provided information only for let's say 'premium' wines.  

We can also can explore data with box plots. There is a nice visualisation what box plot can tell us. 

![logo](https://www.wellbeingatschool.org.nz/sites/default/files/W@S_boxplot-labels.png)
**Resource**: https://www.wellbeingatschool.org.nz/sites/default/files/W@S_boxplot-labels.png

In [None]:
df1= data[data.variety.isin(data.variety.value_counts().head(6).index)]

plt.figure(figsize = (14,6))
sns.boxplot(
    x = 'variety',
    y = 'points',
    data = df1
)

What we can read from this plot? For example, that the Red Blend has low points variance. If the boxes are taller then the variance is higher. If box is higher then other box, then it have more high values then other. 

**If you want to more info visit: ** https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots

# 3. Feature selection
On this stage we want to make our dataset smaller without loosing acuracy of model. 
So how can we do it? 
We can make correlation plot and drop columns which correlation will be close to 1 or -1. 
What is correlation?

> Correlation is a statistical measurement of the relationship between two variables. Possible correlations range from +1 to –1. A zero correlation indicates that there is no relationship between the variables. A correlation of –1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1 indicates a perfect positive correlation, meaning that both variables move in the same direction together.

Reference: https://www.verywellmind.com/what-is-correlation-2794986

There are also some 'automatic' algorithm to do feature selection:

1. Feature selection with correlation - find out which features are correleted and then drop 
all except one.

2. Univariate feature selection - Univariate feature selection works by selecting the best features based on univariate statistical tests.

3. Recursive feature elimination - Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

4. Tree based feature selection - Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled with the sklearn.feature_selection.SelectFromModel meta-transformer):

**Reference**: http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

We will use Univariate feature selection based on feature importances from CatboostRegressor. 

# Feature importance with Catboost.
First we will prepare our train and test data. We will use sklearn Library. 

In [None]:
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor, cv

X=data.drop(columns=['points'])

X=X.fillna(-1)
print(X.columns)
categorical_features_indices =[0,1, 3,4,5,6,7,8,9,10]
y=data['points']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, 
                                                    random_state=52)

In [None]:
categorical_features_indices

Create CatBoostRegressor model with Mean squared error loss function.

In [None]:
def perform_model(X_train, y_train,X_valid, y_valid,X_test, y_test):
    model = CatBoostRegressor(
        random_seed = 400,
        loss_function = 'RMSE',
        iterations=400,
    )
    
    model.fit(
        X_train, y_train,
        cat_features = categorical_features_indices,
        eval_set=(X_valid, y_valid),
        verbose=False
    )
    
    print("RMSE on training data: "+ model.score(X_train, y_train).astype(str))
    print("RMSE on test data: "+ model.score(X_test, y_test).astype(str))
    
    return model
    

Let's run our model and check score.

In [None]:
model=perform_model(X_train, y_train,X_valid, y_valid,X_test, y_test)

Now, we are ready to create feature importance plot. 

In [None]:
feature_score = pd.DataFrame(list(zip(X.dtypes.index, model.get_feature_importance(Pool(X, label=y, cat_features=categorical_features_indices)))),
                columns=['Feature','Score'])

feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')


In [None]:
plt.rcParams["figure.figsize"] = (12,7)
ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
ax.set_xlabel('')

rects = ax.patches

labels = feature_score['Score'].round(2)

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')

plt.show()

Let's try to drop 3 columns which gives least information.

In [None]:
X=data.drop(columns=['points','title', 'region_1'])
X=X.fillna(-1)

print(X.columns)
categorical_features_indices =[0,1,3,4,5,6,7,8]
y=data['points']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, 
                                                    random_state=52)

And now perform the model once again. 

In [None]:
model=perform_model(X_train, y_train,X_valid, y_valid,X_test, y_test)

As we can see our model perform only a little worse, but we save some computing time and RAM usability. Feature selecion technique is much more useful with larger dataset, where a lot of columns are useless.

As we can see the most important feature is price. Tester has also big impact for the points score.

What is next step? You can play with tunning model. Good idea will be also testing XgBoost or neural netoworks approch. If you want to maximize the score you should also read about model stacking and genetic programming. If you want to know how to do NLP on the description, you should see my other kernel. 

# ** If you are intrested in NLP, please check my other kernel on the same data https://www.kaggle.com/mistrzuniu1/catboost-points-predictions-with-simple-nlp/**