# Exploratory Data Analysis on Wine Reviews

---


⌨ Anna Christensen.

This analysis explores wine reviews of wines around the world. Specifically, the goal is to determine which country produces the best wine.

# About the data
The dataset used in this project comprises wine reviews scraped from the Wine Enthusiast website (a magazine and ecommerce business) by Kaggle user zackthoutt during the week of June 15th, 2017. It includes various attributes such as the wine's country of origin, variety, points awarded, price, and detailed descriptions, among others.

In [None]:
import pandas as pd
df=pd.read_csv('/content/wine.csv')
df.head()

Unnamed: 0,country,description,designation,points,price,province,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


#Analysis
The analysis below explores which characteristics make a wine great and which ones do not. Specifically, it will answer the following questions:

1. Which country produces wine with the most points, on average? (descriptive statistics)
2. Which taster gives the lowest scores (points), on average?
3. Which variety of wine is the most expensive, on average?
4. Which year of wines has the best score (points), on average?
5. Do reviews with the word "depth" in them tend to get better than average or worse than average points?
6. Do reviews with the word "fruity" in them tend to get better than average or worse than average points?
7. Do reviews with the word "herbal" in them tend to get better than average or worse than average points?
8. Do reviews with more letters award more or less points, on average?
9. Which region of the province Sicily & Sardinia produces the best wine, on average?

1.Which country produces wine with the most points, on average? (descriptive statistics)

In [None]:
#Which country produces wine with the most points, on average?
avg_points_by_country = df.groupby("country")["points"].mean().sort_values(ascending=False)
top_country = avg_points_by_country.idxmax()
top_avg_points = avg_points_by_country.max()
top_country, round(top_avg_points,2)


('England', 91.58)

2.Which taster gives the lowest scores (points), on average?

In [None]:

avg_points_by_taster = df.groupby("taster_name")["points"].mean().dropna().sort_values()
lowest_scoring_taster = avg_points_by_taster.idxmin()
lowest_avg_points = round(avg_points_by_taster.min(), 2)
lowest_scoring_taster, lowest_avg_points


('Alexander Peartree', 85.86)

3.Which variety of wine is the most expensive, on average?

In [None]:
avg_price_by_variety = df.groupby("variety")["price"].mean().sort_values(ascending=False)
most_expensive_variety = avg_price_by_variety.idxmax()
highest_avg_price = avg_price_by_variety.max()
most_expensive_variety, highest_avg_price


('Ramisco', 495.0)

In [None]:
ramisco_wines = df[df["variety"].str.contains("Ramisco", case=False, na=False)]
ramisco_wines

Unnamed: 0,country,description,designation,points,price,province,taster_name,taster_twitter_handle,title,variety,...,year,contains_depth,contains_fruity,contains_herbal,review_length,region,review_length_chars,depth,fruity,herbal
107854,Portugal,This rare survival comes from Ramisco grapes p...,Reserva Velho,93,495.0,Colares,Roger Voss,@vossroger,Adega Viuva Gomes 1934 Reserva Velho Red (Cola...,Ramisco,...,1934,False,True,False,377,Colares,377,False,True,False


4.Which year of wines has the best score (points), on average?

---



In [None]:
# Extract the year from the 'title' column using the regular expression (\d{4})
df["year"] = df["title"].str.extract(r'(\d{4})').astype("Int64")  # Convert to integer type

# Calculate the average points per year
avg_points_by_year = df.groupby("year")["points"].mean().dropna().sort_values(ascending=False)

# Find the year with the highest average points
best_year = avg_points_by_year.idxmax()
best_avg_points = round(avg_points_by_year.max(), 2)

# Display the result
best_year, best_avg_points



(1969, 98.0)

5.Do reviews with the word "depth" in them tend to get better than average or worse than average points?

In [None]:
df.head()
# Create a new column 'depth' that is True if 'depth' appears in the description, otherwise False
df["depth"] = df["description"].str.contains(r'\bdepth\b', case=False, na=False)

# Calculate the average points for reviews containing "depth"
avg_points_with_depth = round(df[df["depth"]]["points"].mean(), 2)

# Calculate the overall average points
overall_avg_points = round(df["points"].mean(), 2)

# Compare the two averages
avg_points_with_depth, overall_avg_points



(90.09, 88.45)

6.Do reviews with the word "fruity" in them tend to get better than average or worse than average points?

In [None]:
# Create a new column 'fruity' that is True if 'depth' appears in the description, otherwise False
df["fruity"] = df["description"].str.contains(r'\bfruity\b', case=False, na=False)

# Calculate the average points for reviews containing "fruity"
avg_points_with_fruity = round(df[df["fruity"]]["points"].mean(), 2)

# Calculate the overall average points
overall_avg_points = round(df["points"].mean(), 2)

# Compare the two averages
avg_points_with_fruity, overall_avg_points


(87.6, 88.45)

7.Do reviews with the word "herbal" in them tend to get better than average or worse than average points?

In [None]:
# Create a new column 'herbal' that is True if 'depth' appears in the description, otherwise False
df["herbal"] = df["description"].str.contains(r'\bherbal\b', case=False, na=False)

# Calculate the average points for reviews containing "herbal"
avg_points_with_herbal = round(df[df["herbal"]]["points"].mean(), 2)

# Calculate the overall average points
overall_avg_points = round(df["points"].mean(), 2)

# Compare the two averages
avg_points_with_herbal, overall_avg_points


(87.42, 88.45)

8.Do reviews with more letters award more or less points, on average?

In [None]:
# Calculate the correlation between review length (characters) and points
correlation_chars_points = round(df["review_length_chars"].corr(df["points"]), 2)

# Calculate the average points for longer vs. shorter reviews
median_length = df["review_length_chars"].median()
avg_points_long_reviews = round(df[df["review_length_chars"] > median_length]["points"].mean(), 2)
avg_points_short_reviews = round(df[df["review_length_chars"] <= median_length]["points"].mean(), 2)

# Display the results
correlation_chars_points, avg_points_long_reviews, avg_points_short_reviews


(0.56, 89.82, 87.09)

In [None]:
#Use string functions to create a new column that indicates the number of characters in a wine review (description).
#Then, determine the correlation between the length of a review in characters and points.
#What is the relationship between number of characters and points?


# Create a new column that stores the number of characters in each wine review
df["review_length_chars"] = df["description"].str.len()

# Calculate the correlation between review length (characters) and points
correlation_chars_points = df["review_length_chars"].corr(df["points"])

# Display the correlation result
round(correlation_chars_points,2)

0.56

9.Which region of the province Sicily & Sardinia produces the best wine, on average?

In [None]:
# Extract region names from the 'title' column using single parentheses
df["region"] = df["title"].str.extract(r'\(([^)]+)\)')

# Filter data for the province "Sicily & Sardinia"
sicily_sardinia_wines = df[df["province"] == "Sicily & Sardinia"]

# Calculate the average points for each extracted region within Sicily & Sardinia
avg_points_by_region = sicily_sardinia_wines.groupby("region")["points"].mean().dropna().sort_values(ascending=False)

# Find the region with the highest average points
best_region = avg_points_by_region.idxmax()
best_avg_points = round(avg_points_by_region.max(), 2)

# Display the result
best_region, best_avg_points


('Faro', 94.0)

# Conclusion

---
*   England produces the highest-rated wines on average, with a score of 91.58 points.
*   Ramisco is the most expensive variety, with an average price of $495.00.
Best Year for Wine (by average score)
*  1969 wines had the highest average rating of 98.0 points.
*  Reviews containing "depth" tend to score higher than average (90.09 points vs. 88.45 points).
*  Longer reviews tend to give higher scores (correlation: 0.56).
*  Longer reviews (above median length): 89.82 points
Shorter reviews: 87.09 points
*  The Faro region produces the highest-rated wines in Sicily & Sardinia, with an average score of 94.0 points.
*  Alexander Peartree gives the lowest scores on average (85.86 points).

Final takeaway -
Wines with detailed reviews and descriptors like "depth" tend to receive higher ratings than those with "fruity", or "herbal".
An expensive wine (Ramisco) does not always guarantee the highest ratings.
Certain tasters are harsher critics, which skews the average scores.
England, Faro (Sicily & Sardinia), and 1969 wines stand out in quality for the wines reviewed.
