# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand. This tutorial will cover different operations we can apply to our data to get the input "just right." 

We'll use the Wine Magazine data for demonstration.

In [21]:
import pandas as pd
pd.options.display.max_rows = 10
import numpy as np
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

# Summary functions

Pandas provides many simple "summary functions" which restructure the data in some useful way. For example, consider the `describe()` method:

In [22]:
reviews['points'].describe()

count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [23]:
reviews['taster_name'].describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [24]:
reviews['points'].mean()

np.float64(88.44713820775404)

To see a list of unique values we can use the `unique()` function:

In [25]:
reviews['taster_name'].unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [26]:
reviews['taster_name'].value_counts()

taster_name
Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
                      ...  
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: count, Length: 19, dtype: int64

# Expanding on Useful Functions

Here are additional useful functions that can help manipulate or summarize data:

### `min()` and `max()`
Find the minimum and maximum values in a column:

In [27]:
reviews['points'].min()

np.int64(80)

In [28]:
reviews['points'].max()

np.int64(100)

### `isnull()` and `notnull()`
Check for missing values:

In [29]:
reviews['price'].isnull().sum()

np.int64(8996)

In [30]:
reviews['price'].notnull().sum()

np.int64(120975)

### `sum()`
Calculate the sum of values in a column:

In [31]:
reviews['points'].sum()

np.int64(11495563)

### `count()`
Count the number of non-null entries in a column:

In [32]:
reviews['points'].count()

np.int64(129971)

### `apply()`
Perform row-wise transformations using a custom function:

In [33]:
def transform_points(row):
    row['points'] = row['points'] + 5
    return row

reviews.apply(transform_points, axis='columns').head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,92,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,92,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,92,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,92,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,92,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


These functions provide flexibility to perform a variety of operations on your data, helping you clean, summarize, and transform it as needed.