# Lab Assignment One: Exploring Table Data
Andrew Sneed

Tristan Knotts

Fernando Corral

Machine Learning - CSE 5324


## Imports

In [4]:
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Business Understanding



<p> This dataset, "winemag Data," consists of information from nearly 130,000 winemag.com wine reviews. In addition to the reviewers' description of and points awarded (out of 100) to each wine, the dataset also contains factual information on each wine, i.e., location information, designation, price, variety, title, and winery. The dataset was orginally scraped by Kaggle user Zack Thoutt (username: Zackthoutt). His desire was to utilize the data to "create a predictive model to identify wines through blind tasting like a master sommelier". 
    
Our prediction task would be to estimate the points a reviewer would award a particular wine, given only its factual characteristics. Those who read product reviews fall into two categories: Some read the full review, and some focus solely on the rating. Because many readers care only, or mostly, about the rating, winemag.com is incentivized to be sure that reviewers are practicing original thought when rating the wine. For instance, a strong correlation between points awarded and wine price would make these ratings far less useful to readers. An algorithm that could accurately predict what score a given reviewer will award each wine could serve as an accountability check for reviewers to serve their readers by practicing original thought. Other publications that emphasize ratings, like Pitchfork, may also be interested in this algorithm to hold their reviewers accountable.

A more ethically troubling endgame for this prediction task would be to replace reviewers with algorithms. </p>

===================================================================================

Dataset: Human Resources Analytics URL: https://www.kaggle.com/zynicide/wine-reviews

## 2. Data Understanding

### 2.1 Data Description

In [5]:
df = pd.read_csv('./wine-reviews/winemag-data-130k-v2.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [7]:
print("Number of provinces:",len(df.province.unique()))
print("Number of countries:",len(df.country.unique()))
print("Number of primary regions :",len(df.region_1.unique()))
print("Number of secondary regions :",len(df.region_2.unique()))
print("Number of wine varieties:",len(df.variety.unique()))
print("Number of wineries types:",len(df.winery.unique()))
print("Number of designations:",len(df.designation.unique()))

print("Number of reviewers:",len(df.taster_name.unique()))

Number of provinces: 426
Number of countries: 44
Number of primary regions : 1230
Number of secondary regions : 18
Number of wine varieties: 708
Number of wineries types: 16757
Number of designations: 37980
Number of reviewers: 20


In [20]:
data_des = pd.DataFrame()

data_des['Features'] = df.columns
data_des['Scales'] = ["-Row needs to be removed-","Nominal","Nominal","Nominal", "Ratio", "Ordinal" ,"Nominal" ,"Nominal" ,"Nominal" ,"Nominal" ,"Nominal" ,"Nominal" ,"Nominal" ,"Nominal"]

data_des

Unnamed: 0,Features,Scales
0,Unnamed: 0,-Row needs to be removed-
1,country,Nominal
2,description,Nominal
3,designation,Nominal
4,points,Ratio
5,price,Ordinal
6,province,Nominal
7,region_1,Nominal
8,region_2,Nominal
9,taster_name,Nominal


### 2.2 Data Quality

## 3. Data Visualization

In [None]:
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

import missingno as mn

mn.matrix(df.sort_values(by=['price','designation','region_1','region_1','region_2','taster_twitter_handle']))

In [None]:
for col in ['description','region_2','taster_twitter_handle']:
    if col in df:
        del df[col]

In [None]:
nona = df.dropna()
x = np.random.normal(nona['price'])
sns.distplot(x);

### Analysis of price vs rating comparison


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc urna est, ultrices nec varius eget, tincidunt ac orci. Mauris vitae tellus rutrum metus tincidunt sodales ut nec ipsum. Sed dapibus, sapien in feugiat tempus, ligula felis cursus metus, quis cursus nulla nunc commodo ante. Quisque ipsum arcu, tincidunt eu porttitor et, placerat ac lacus. Nam rhoncus elit sit amet quam tincidunt auctor. Morbi eget dui euismod, ornare tortor eu, laoreet tellus. Vestibulum elit sem, rhoncus quis ornare ac, molestie vitae sem. Donec posuere ac nisl nec condimentum. In dignissim tellus dui, mollis sodales mi feugiat vel. Proin suscipit, orci id scelerisque egestas, libero tellus sagittis massa, nec laoreet orci odio a sem. Sed porttitor ullamcorper lorem. Donec vehicula, nunc a aliquam bibendum, orci libero cursus dui, in placerat ex magna quis tortor.

In [None]:
nona_sort = nona.sort_values(by=['price'],ascending=False)
nona_sort.head()
nona_nooutlier = nona_sort.drop(index=120391)

plt.figure(figsize=(10,6.2))
plt.scatter(nona_nooutlier['price'],nona_nooutlier['points'])
plt.title('Loose Trend Between Points Awarded and Price')
plt.xlabel('Price in USD')
plt.ylabel('Wine Review Score')
plt.show()

In [None]:
# Correlation coefficient between points and price
nona['points'].corr(nona['price'])

In [None]:
df_grouped = nona.groupby(by=['taster_name'])
#df_countries_1000 = 
df_tasters = df_grouped.describe()
df_tasters_points = df_grouped.describe()['points']
df_tasters_points

In [None]:
# sns boxplot
dims = (30, 10)

fig, ax = plt.subplots(figsize=dims)
sns.boxplot(ax=ax, x='taster_name', y='points', data=nona)
plt.show()

In [None]:
nona['points'].corr(nona['price'])

## Reference