## *Applied Machine Learning*

# *Assignment 3*

***Teammates:***

Karthik Rajaraman Iyer
kr2859@columbia.edu

Anjani Prasad Atluri
aa4462@columbia.edu

In [0]:
#Installs
!pip install xgboost
!pip install --upgrade category_encoders

In [0]:
#imports
from google.colab import drive
import pandas as pd
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import make_column_transformer 
from category_encoders import TargetEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from nltk.stem.snowball import EnglishStemmer
from sklearn.model_selection import cross_val_score 
from sklearn.preprocessing import StandardScaler 
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import download
from sklearn.linear_model import RidgeCV
from category_encoders import CatBoostEncoder
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize     
import numpy as np
import nltk     
from nltk.stem import WordNetLemmatizer 
from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import accuracy_score, r2_score
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor,GradientBoostingRegressor,RandomForestRegressor

In [2]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
#mounting drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
#data loading
df = pd.read_csv('/content/gdrive/My Drive/Columbia Photos/winemag-data-130k-v2.csv')

#Pre-processing

We are removing the designation, country, region_2, taster_twitter_handle, and Unnamed: 0 columns. designation column because it is a categorical column with too many unique values and there might be a possible leak of information. We are subsetting on the country column (taking only US) and removing it once we subsetted on it. We are removing the region_2 column as it has too many null values and the region_1 is giving necessary region information, and if we kept both there might be a possibility of colinearity. Twitter handle column as it is a unique column and it will leak target information. 'Unnamed: 0' is the index column. 

In [71]:
#subsample to winde form the U.S.
df = df[df['country']=='US']

#Removing columns (designation, country, region_2, taster_twitter_handle,Unnamed: 0)
columns = ['designation', 'country', 'region_2', 'taster_twitter_handle','Unnamed: 0']
df.drop(columns, inplace=True, axis=1)

print("Column name and the amount of missing data in that column")

#Seeing if there are NAs in rest of the columns
print(df.isnull().sum(axis = 0))

#removing outliers from the price column
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['price'] < (Q1 - 1.5 * IQR)) |(df['price'] > (Q3 + 1.5 * IQR)))]

#Subset the data
df1=df.sample(frac=0.5, random_state=0)
#df1 = df.copy()

#Splitting the dataset into target variables and the covariates
y=pd.DataFrame(df1['points']).values
X= df1.loc[:, df1.columns != 'points']

#Splitting the dataset into traning and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)


Column name and the amount of missing data in that column
description        0
points             0
price            239
province           0
region_1         278
taster_name    16774
title              0
variety            0
winery             0
dtype: int64


We have removed the outliers from the price column to make better predicitons.

### **Task 1.1**

In [6]:
X_train1 = X_train.drop(['description','title'],inplace=False, axis=1)
X_test1 = X_test.drop(['description','title'],inplace=False, axis=1)

#Getting the categorical and the continuous columns of the dataframe
cat_cols= list(X_train1.select_dtypes(object).columns)
cont_cols= list(X_train1.columns[X_train1.dtypes !=object])
cat_cols.remove('taster_name')
cat_cols_on = ['taster_name']

#dropping indices of the columns
X_train1.reset_index(drop=True,inplace=True)

cont_preprocessing = make_pipeline(SimpleImputer(strategy='median'),StandardScaler())

cat_preprocessing = make_pipeline(SimpleImputer(strategy='constant',fill_value='np'),CatBoostEncoder(handle_missing='value',return_df=False),StandardScaler())

cat_preprocessing_on = make_pipeline(SimpleImputer(strategy='constant',fill_value='NA'),OneHotEncoder(handle_unknown='ignore', sparse=False))

preprocess = make_column_transformer((cat_preprocessing, cat_cols), (cont_preprocessing,cont_cols),(cat_preprocessing_on,cat_cols_on))

pipe = Pipeline([("pre",preprocess),("regressor",xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.1,max_depth = 15, alpha = 0.1, n_estimators = 300,min_samples_split=7))])

pipe_grid={}
grid = GridSearchCV(pipe,param_grid=pipe_grid,cv=5,scoring="r2", n_jobs=-1)
grid.fit(X_train1,y_train)

print("Baseline's score on the test set", grid.best_estimator_.score(X_test1, y_test))

Baseline's score on the test set 0.4192172244940921


We chose XGBoost Regressor as our baseline model. We have encoded the taster's name using one hot encoding and the rest of the categorical columns using catboost encoding. We have imputed all the missing data from all the columns and scaled the continuous column 'price'.

## **Task 1.2** 

### Simple Bag of Words Model

In [0]:
#Including the Lemmatizer from nltk by defining LemmaTokenizer class reference: https://scikit-learn.org/stable/modules/feature_extraction.html

class LemmaTokenizer:
  def __init__(self):
    self.wnl = WordNetLemmatizer()
  def __call__(self, doc):
    return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

#Making a Stemming function from nltk reference: https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn
stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))


In [8]:
#Simple text-based model using bag of words and a linear model

#Append the title column data to the description column data and making a new dataframe
nX_train= list(X_train['description'] +' '+ X_train['title'])
nX_test = list(X_test['description'] +' '+ X_test['title'])

vect = CountVectorizer(stop_words="english")
teXt_train = vect.fit_transform(nX_train)
teXt_test = vect.transform(nX_test)

clf = Ridge(alpha=1).fit(teXt_train, y_train)
print("testing score for plain bag of words model: ",clf.score(teXt_test,y_test))

testing score for plain bag of words model:  0.6694834440072108


## **Task 1.3**

In [14]:
#n-grams (1 to 4 grams) with stemmer

gram14 = CountVectorizer(ngram_range=(1, 4), min_df=2, stop_words="english",lowercase=True, analyzer=stemmed_words)
X_train_grm14 = gram14.fit_transform(nX_train)
lr_grm14 = Ridge(alpha=22).fit(X_train_grm14, y_train)
X_test_grm14 = gram14.transform(nX_test)
print("Testing error on 1-4 grams: ",lr_grm14.score(X_test_grm14, y_test))

Testing error on 1-4 grams:  0.7344068405323045


We found that the n-gram model gave best results when we chose the n-gram range as 1-4. We also removed stop words, and performed word stemming. We referred to the following page for adding word stemming https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn.

We restricted the words to the ones that appeared in more than 2 documents.

In [18]:
#character n-grams with tuning and with high regularization

# high regularization because the no of features will increase and we shouldn't be giving high importance (coefficients to each of them)
#It performs well with high regularization

cgram = CountVectorizer(tokenizer=LemmaTokenizer(),ngram_range=(1, 3), min_df=3, stop_words="english",lowercase=True,analyzer="char_wb")
X_train_cgrm = cgram.fit_transform(nX_train)
lr_cgrm = Ridge(alpha=90).fit(X_train_cgrm, y_train)
X_test_cgrm = cgram.transform(nX_test)
print("Testing error on character grams with word boundaries: ",lr_cgrm.score(X_test_cgrm, y_test))

Testing error on character grams with word boundaries:  0.7093071752230775


The character n-grams performed well with 1-3 characters, compared to the other models this was not as good. We also had to do high regularization for this model as there are more no of features and if not regularized properly the model will not generalize well. We again removed the stop words, split the characters by respecting their boundaries. We have taken characters that appeared in more than 3 documents. 

In [27]:
#tf-idf rescaling

tiv = TfidfVectorizer(stop_words="english", tokenizer=LemmaTokenizer())
tfidf_X = tiv.fit_transform(nX_train)
lr_tfidf = Ridge(alpha=0.5).fit(tfidf_X, y_train)
tfidf_X_test = tiv.transform(nX_test)
print("Testing error on data after tf-idf rescaling: ",lr_tfidf.score(tfidf_X_test, y_test))


Testing error on data after tf-idf rescaling:  0.725454744901997


The tf-idf rescaled model performed better with lemmatization. We referred to the following page for adding lemmatization  https://scikit-learn.org/stable/modules/feature_extraction.html .
 

In [10]:
#tuned bag of words model with stop words, different tokenization, removed non-frequent words, stemming

vect = CountVectorizer(min_df=3, stop_words="english",lowercase=True, analyzer=stemmed_words, token_pattern=r"\b\w[\w’]+\b")
teXt_train = vect.fit_transform(nX_train).todense()
teXt_test = vect.transform(nX_test).todense()

clf = Ridge(alpha=11).fit(teXt_train, y_train)
print("testing score for tuned bag of words model: ",clf.score(teXt_test,y_test))

testing score for tuned bag of words model:  0.732732383755386


The bag of words model performed better at an alpha of 11, and with word stemming. We also removed the stop words and only included the words that appeared in more than 3 documents.

## **Task 1.4**

In [66]:
#adding the tuned bag of words to the other data from task 1.1

#making column names for the bag of words data
c=[]
for i in range(0,teXt_train.shape[1]):
  c.append('p'+str(i))

X_test1.reset_index(drop=True,inplace=True)

Xtr=pd.DataFrame(teXt_train, columns = c)
Xte=pd.DataFrame(teXt_test, columns = c)

Xtr.reset_index(drop=True,inplace=True)
Xte.reset_index(drop=True,inplace=True)

#Merging the bag of words data and the rest of the data
X_train2 = pd.concat([Xtr, X_train1], axis=1)
X_test2 = pd.concat([Xte, X_test1], axis=1)

X_train2.reset_index(drop=True,inplace=True)
X_test2.reset_index(drop=True,inplace=True)

preprocess = make_column_transformer((cat_preprocessing, cat_cols), (cont_preprocessing,cont_cols), ( 'passthrough',c) , (cat_preprocessing_on,cat_cols_on))

pipe = Pipeline([("pre",preprocess),("regressor",Ridge())])

pipe_grid={"regressor__alpha": [15, 20]}
grid = GridSearchCV(pipe,param_grid=pipe_grid,cv=2,scoring="r2")
grid.fit(X_train2,y_train)

print("Best alpha value is:", grid.best_params_)
print("model with bag of words and rest of the data's score on the test set", grid.best_estimator_.score(X_test2, y_test))

Best alpha value is: {'regressor__alpha': 20}
model with bag of words and rest of the data's score on the test set 0.7655129909622163


The n-gram model performed the best of all our model in the task 1.3. The encodings from the bag of words model with the rest of the data performed better than the n-grams model. Adding the rest of the data to the bag of words model actually helped to improve the performance. 