## Exploratory Analysis

One of the main tasks to perform with Pandas is exploratory analysis. Looking at data, finding what is useful or potentially wrong with it so that you can clean it up are core practices of a data scientist and data engineer.

## Create a Pandas Dataframe 
Load a CSV to start working with the data and performing exploratory analysis.

In [1]:
import pandas as pd
csv_url = "https://raw.githubusercontent.com/paiml/wine-ratings/main/wine-ratings.csv"
df = pd.read_csv(csv_url, index_col=0)

In [6]:
# The most common operation is with .head() 
df.head(15)

Unnamed: 0,name,region,variety,rating,notes
0,1000 Stories Bourbon Barrel Aged Batch Blue Ca...,"Mendocino, California",Red Wine,91.0,"This is a very special, limited release of 100..."
1,1000 Stories Bourbon Barrel Aged Gold Rush Red...,California,Red Wine,89.0,The California Gold Rush was a period of coura...
2,1000 Stories Bourbon Barrel Aged Gold Rush Red...,California,Red Wine,90.0,The California Gold Rush was a period of coura...
3,1000 Stories Bourbon Barrel Aged Zinfandel 2013,"North Coast, California",Red Wine,91.0,"The wine has a deep, rich purple color. An int..."
4,1000 Stories Bourbon Barrel Aged Zinfandel 2014,California,Red Wine,90.0,Batch #004 is the first release of the 2014 vi...
5,1000 Stories Bourbon Barrel Aged Zinfandel 2016,California,Red Wine,91.0,"1,000 Stories Bourbon barrel-aged Zinfandel is..."
6,1000 Stories Bourbon Barrel Aged Zinfandel 2017,California,Red Wine,92.0,"Batch 55 embodies an opulent vintage, which sa..."
7,12 Linajes Crianza 2014,"Ribera del Duero, Spain",Red Wine,92.0,Red with violet hues. The aromas are very inte...
8,12 Linajes Reserva 2012,"Ribera del Duero, Spain",Red Wine,94.0,"On the nose, a complex predominance of mineral..."
9,14 Hands Cabernet Sauvignon 2010,"Columbia Valley, Washington",Red Wine,87.0,Concentrated aromas of dark stone fruits and t...


In [None]:

# Now lets get a description of the data
df.describe()

In [None]:
# You can also get metadata about the dataset with .info()
df.info()

In [None]:
# sort based on some condition
df.sort_values(by="rating", ascending=False).head()

Remove any newlines or carriage returns

In [None]:
df = df.replace({"\r": ""}, regex=True)
df = df.replace({"\n": " "}, regex=True)
df.head(10)

In [3]:
# the grape is not a very good column, lets remove it and describe it again
df.drop(['grape'], axis=1, inplace=True)
df.describe()

Unnamed: 0,rating
count,32780.0
mean,91.186608
std,2.190391
min,85.0
25%,90.0
50%,91.0
75%,92.0
max,99.0


In [7]:
# Specific operations by method. Like .mean()
df.groupby("region").mean()

Unnamed: 0_level_0,rating
region,Unnamed: 1_level_1
"Abruzzo, Italy",89.954545
"Aconcagua Valley, Chile",90.633333
"Adelaida District, Paso Robles, Central Coast, California",92.357143
"Adelaide Hills, South Australia, Australia",89.625000
"Adelaide, South Australia, Australia",90.000000
...,...
"Yakima Valley, Columbia Valley, Washington",91.141304
"Yamhill-Carlton District, Willamette Valley, Oregon",91.404762
"Yarra Valley, Victoria, Australia",89.630769
"Yorkville Highlands, Mendocino, California",92.000000
