## Import necessary tools

In [1]:
# !pip install --upgrade lightgbm
# !pip install pycodestyle flake8 pycodestyle_magic
# !conda install py-xgboost
%load_ext pycodestyle_magic

In [35]:
import numpy as np
import pandas as pd
import gc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
import warnings
import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
import pickle

%matplotlib inline
warnings.filterwarnings("ignore")


The first thing I did was to read the data into a pandas dataframe

In [36]:
data = pd.read_csv('../data_root/raw/wine_dataset.csv')


In [37]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
Unnamed: 0               10000 non-null int64
country                  9994 non-null object
description              10000 non-null object
designation              7171 non-null object
points                   10000 non-null int64
price                    9323 non-null float64
province                 9994 non-null object
region_1                 8336 non-null object
region_2                 3853 non-null object
taster_name              8015 non-null object
taster_twitter_handle    7644 non-null object
title                    10000 non-null object
variety                  10000 non-null object
winery                   10000 non-null object
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB


## My analysis will be done based on five aspects:

1. Exploratory Data Analysis (EDA)
2. Data Visualization
3. Feature Engineering
4. Model Building and Evaluation
5. Results and Conclusion

### Exploratory Data Analysis

Let's get a first look at our data

In [38]:
data.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Fragrances suggest hay, crushed tomato vine an...",Kirchleiten,90,30.0,Northeastern Italy,Alto Adige,,Kerin O’Keefe,@kerinokeefe,Tiefenbrunner 2012 Kirchleiten Sauvignon (Alto...,Sauvignon,Tiefenbrunner
1,1,France,"Packed with fruit and crisp acidity, this is a...",,87,22.0,Loire Valley,Sancerre,,Roger Voss,@vossroger,Bernard Reverdy et Fils 2014 Rosé (Sancerre),Rosé,Bernard Reverdy et Fils
2,2,Italy,"This easy, ruby-red wine displays fresh berry ...",,86,,Tuscany,Chianti Classico,,,,Dievole 2009 Chianti Classico,Sangiovese,Dievole
3,3,US,Pretty in violet and rose petals this is a low...,Horseshoe Bend Vineyard,92,50.0,California,Russian River Valley,Sonoma,Virginie Boone,@vboone,Davis Family 2012 Horseshoe Bend Vineyard Pino...,Pinot Noir,Davis Family
4,4,US,This golden wine confounds in a mix of wet sto...,Dutton Ranch,93,38.0,California,Russian River Valley,Sonoma,Virginie Boone,@vboone,Dutton-Goldfield 2013 Dutton Ranch Chardonnay ...,Chardonnay,Dutton-Goldfield


We can see from the sample of our data above. We note that our data is mostly text, contains quite a number of missing values, has only 2 numeric columns and we are to predict one of them (points)

There are usually two kinds of bad data: duplicate and misssing.
I decided to check for duplicate data and delete them if they exist.
My assumption: If the data contains the same name, title and description, then it is the same data

In [39]:
print("The total number of data samples: ", data.shape[0])
print("Duplicate data ", data[data.duplicated([
    'taster_name', 'title', 'description']
)].shape[0])


The total number of data samples:  10000
Duplicate data  58


we can see that there are 58 duplicate samples in our dataset so we have to drop them

In [40]:
data = data.drop_duplicates(['taster_name', 'title', 'description'])


Now we would need to take care of missing data, let's first check how many they are

In [41]:
data.isnull().sum()


Unnamed: 0                  0
country                     6
description                 0
designation              2813
points                      0
price                     676
province                    6
region_1                 1653
region_2                 6112
taster_name              1983
taster_twitter_handle    2353
title                       0
variety                     0
winery                      0
dtype: int64

That's quite a lot of missing data and we are going to have to take care of them

In [42]:
# if there are any infinity values in the data,
# replace with NaN
data = data.replace([np.inf, -np.inf], np.nan)


In [43]:
data.isnull().sum()


Unnamed: 0                  0
country                     6
description                 0
designation              2813
points                      0
price                     676
province                    6
region_1                 1653
region_2                 6112
taster_name              1983
taster_twitter_handle    2353
title                       0
variety                     0
winery                      0
dtype: int64

### Data Visualization

In [None]:
sns.countplot(data['points'])


We can see that all wine points are between 80 - 100 and most wine points are 88. Now lets see how price affects the points

In [12]:
plt.figure(figsize=(10, 4))

g = sns.regplot(x='points', y='price', data=data, fit_reg=True)
g.set_title("Points x Price Distribuition", fontsize=20)
g.set_xlabel("Points", fontsize=15)
g.set_ylabel("Price", fontsize=15)

plt.show()


10:1: W391 blank line at end of file


As we can see from the above, the higher the price, the higher the probability of getting a high point is, and this seems quite logical

Now let us see how country affects price.

In [None]:
data = data[['price', 'points', 'country', 'province']].copy()


In [None]:
data.head()


**We have to be careful while generating the features in order to avoid data leakage**

In [49]:
%%pycodestyle
def handle_missing_values(data):
    data = data.fillna(data.mean())  # fill missing values with the mean
    # the rows which have country and province empty
    # can be done away with since they are only 6
    data = data.dropna()
    return data


def data_trans(df, place, obj, stat):
    return df.groupby(place)[obj].transform(stat)


def data_diff(df, col1, col2):
    return df[col1] - df[col2]


def generate_price_features(df):
    df['price_per_country_mean'] = data_trans(
                                        df,
                                        'country',
                                        'price',
                                        'mean'
                                    )
    df['price_per_country_mean_diff'] = data_diff(
                                          df,
                                          'price',
                                          'price_per_country_mean'
                                        )
    df['price_per_country_median'] = data_trans(
                                        df,
                                        'country',
                                        'price',
                                        'median'
                                     )
    df['price_per_country_median_diff'] = data_diff(
                                            df,
                                            'price',
                                            'price_per_country_median'
                                          )
    df['price_per_province_mean'] = data_trans(
                                        df,
                                        'province',
                                        'price',
                                        'mean'
                                    )
    df['price_per_province_mean_diff'] = data_diff(
                                          df,
                                          'price',
                                          'price_per_province_mean'
                                        )
    df['price_per_province_median'] = data_trans(
                                        df,
                                        'province',
                                        'price',
                                        'median'
                                     )
    df['price_per_province_median_diff'] = data_diff(
                                            df,
                                            'price',
                                            'price_per_province_median'
                                          )
    return df


def generate_point_features(df):
    df['points_per_country_mean'] = data_trans(
                                        df,
                                        'country',
                                        'price',
                                        'mean'
                                    )
    df['points_per_country_mean_diff'] = data_diff(
                                          df,
                                          'price',
                                          'price_per_country_mean'
                                        )
    df['points_per_country_median'] = data_trans(
                                        df,
                                        'country',
                                        'price',
                                        'median'
                                     )
    df['points_per_country_median_diff'] = data_diff(
                                            df,
                                            'points',
                                            'points_per_country_median'
                                          )
    df['points_per_province_mean'] = data_trans(
                                        df,
                                        'province',
                                        'points',
                                        'mean'
                                    )
    df['points_per_province_mean_diff'] = data_diff(
                                          df,
                                          'points',
                                          'points_per_province_mean'
                                        )
    df['points_per_province_median'] = data_trans(
                                        df,
                                        'province',
                                        'points',
                                        'median'
                                     )
    df['points_per_province_median_diff'] = data_diff(
                                            df,
                                            'points',
                                            'points_per_province_median'
                                          )
    return df


112:1: W391 blank line at end of file


In [45]:
# %%pycodestyle
data = handle_missing_values(data)
data = generate_price_features(data)
data = generate_point_features(data)
data.isnull().sum()


Unnamed: 0                         0
country                            0
description                        0
designation                        0
points                             0
price                              0
province                           0
region_1                           0
region_2                           0
taster_name                        0
taster_twitter_handle              0
title                              0
variety                            0
winery                             0
price_per_country_mean             0
price_per_country_mean_diff        0
price_per_country_median           0
price_per_country_median_diff      0
price_per_province_mean            0
price_per_province_mean_diff       0
price_per_province_median          0
price_per_province_median_diff     0
points_per_country_mean            0
points_per_country_mean_diff       0
points_per_country_median          0
points_per_country_median_diff     0
points_per_province_mean           0
p

In [47]:
data.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,...,price_per_province_median,price_per_province_median_diff,points_per_country_mean,points_per_country_mean_diff,points_per_country_median,points_per_country_median_diff,points_per_province_mean,points_per_province_mean_diff,points_per_province_median,points_per_province_median_diff
3,3,US,Pretty in violet and rose petals this is a low...,Horseshoe Bend Vineyard,92,50.0,California,Russian River Valley,Sonoma,Virginie Boone,...,38.0,12.0,89.511891,2.488109,90,2,89.790674,2.209326,90,2
4,4,US,This golden wine confounds in a mix of wet sto...,Dutton Ranch,93,38.0,California,Russian River Valley,Sonoma,Virginie Boone,...,38.0,0.0,89.511891,3.488109,90,3,89.790674,3.209326,90,3
12,12,US,Brooding dark fruit and luxurious French oak s...,The Pixie,92,45.0,Washington,Red Mountain,Columbia Valley,Sean P. Sullivan,...,30.0,15.0,89.511891,2.488109,90,2,89.173423,2.826577,90,2
34,34,US,"From the wilds of Atlas Peak, this lovely wine...",Oso Malo,92,75.0,California,Napa Valley,Napa,Virginie Boone,...,38.0,37.0,89.511891,2.488109,90,2,89.790674,2.209326,90,2
48,48,US,"Clean and varietal, this firm and juicy Pinot ...",Signature Collection,88,27.0,Oregon,Oregon,Oregon Other,Paul Gregutt,...,36.0,-9.0,89.511891,-1.511891,90,-2,89.275194,-1.275194,90,-2


#### Building a model to select the most important features; It is not the main model for prediction

select the generated features

In [50]:
%%pycodestyle
train_features = [
    'price',
    'price_per_country_mean',
    'price_per_country_mean_diff',
    'price_per_country_median',
    'price_per_country_median_diff',
    'price_per_province_mean',
    'price_per_province_mean_diff',
    'price_per_province_median',
    'price_per_province_median_diff',
    'points_per_country_mean',
    'points_per_country_median',
    'points_per_province_mean',
    'points_per_province_median',
]

target_feature = 'points'


19:1: W391 blank line at end of file


In [51]:
df_train = data[train_features].copy().values
target = data[target_feature].copy().values


4:1: W391 blank line at end of file


### Split the dataset into train and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size=0.3)


## Time to build the model

I first build an xgb model for the dataset and test it on validation part using KFold

In [53]:
kf = KFold(n_splits=10, shuffle=True)
for train_index, test_index in kf.split(X_train):
    X_, X_valid = X_train[train_index], X_train[test_index]
    y_, y_valid = y_train[train_index], y_train[test_index]
    sc = StandardScaler()
    X_ = sc.fit_transform(X_)
    X_valid = sc.transform(X_valid)
    std_mean = sc.mean_
    std_var = sc.var_
    xgb_model = xgb.XGBRegressor(
                    n_estimators=1000,
                    max_depth=20,
                    importance_type="gain",
                    learning_rate=0.01,
                    n_jobs=4
                )
    xgb_model.fit(X_, y_,
                  early_stopping_rounds=5,
                  eval_set=[(X_valid, y_valid)],
                  eval_metric="rmse",
                  verbose=True)


23:1: W391 blank line at end of file


Now we save the model to a pickle file

In [54]:
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(xgb_model, file)
print('saved')


6:1: W391 blank line at end of file


### Evaluating the model

we first read the model file then begin evaluation

In [55]:
with open("pickle_model.pkl", 'rb') as file:
    recovered_lgb_model = pickle.load(file)


4:1: W391 blank line at end of file


#### Predicting the values of the test set

In [None]:
X_test_std = (X_test - std_mean) / (std_var ** 0.5)
predictions = recovered_lgb_model.predict(X_test)


In [None]:
print(predictions)


In [None]:
error_ = mse(predictions, y_test)
print(error_)


### our mean square error gives us an error of 11.16

We check for feature importance to know which parameters were very important in creating the model

In [None]:
feature_importance = xgb_model.feature_importances_


In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(x=feature_importance, y=train_features)


#### We can see that important features that are used to generate the predictions so we select them

In [None]:
important_features = [
    'price',
    'price_per_country_mean',
    'price_per_country_mean_diff',
    'price_per_country_median',
    'price_per_country_median_diff',
    'price_per_province_mean',
    'price_per_province_mean_diff',
    'points_per_country_mean',
    'points_per_country_median',
    'points_per_province_mean'
]


**we would save these features to ensure persistence
so for every new data we get, we would use these features to generate the features**