Welcome back! Let's grab the data from part one, and get started. We'll need to import all our favorite packages first.

In [1]:
import pandas as pd 

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns



With Pandas, you can either read the csv we made in the previous lab, or you can pass the link to the <a href="https://raw.githubusercontent.com/Zipcoder/DataEngineering.Labs.WineQuality/master/Combined%20Wine%20Data.csv?token=ALOLXONPYPXMSJ6T4KTISB26BZHLG">raw content</a>




In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/Zipcoder/DataEngineering.Labs.WineQuality/\
master/Combined%20Wine%20Data.csv?token=ALOLXONPYPXMSJ6T4KTISB26BZHLG")

In [3]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


Ok - let's learn how to manipulate a DataFrame. First off, it's kind of annoying that the color column is last. Let's put that right up front.

It should be noted, there are lots of ways to reaarange/rename DataFrames, below is a really simple and manula example, to get you comfortable with how to access elements of dfs.

In [4]:
#let's get the current order of the columns
columns = data.columns.to_list()
columns

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'color']

In [5]:
#for simplicity lets first arrange the columns alphabetically, and then bring the 'color' row to the front
columns.sort()
columns

['alcohol',
 'chlorides',
 'citric acid',
 'color',
 'density',
 'fixed acidity',
 'free sulfur dioxide',
 'pH',
 'quality',
 'residual sugar',
 'sulphates',
 'total sulfur dioxide',
 'volatile acidity']

In [6]:
#pop() and insert () are both built in function. Feel free to read up on it on your own.

columns.insert(0, columns.pop(columns.index('color')))

In [7]:
columns

['color',
 'alcohol',
 'chlorides',
 'citric acid',
 'density',
 'fixed acidity',
 'free sulfur dioxide',
 'pH',
 'quality',
 'residual sugar',
 'sulphates',
 'total sulfur dioxide',
 'volatile acidity']

In [8]:
rearranged_data = data[columns]
rearranged_data.head()

Unnamed: 0,color,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
0,red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7
1,red,9.8,0.098,0.0,0.9968,7.8,25.0,3.2,5,2.6,0.68,67.0,0.88
2,red,9.8,0.092,0.04,0.997,7.8,15.0,3.26,5,2.3,0.65,54.0,0.76
3,red,9.8,0.075,0.56,0.998,11.2,17.0,3.16,6,1.9,0.58,60.0,0.28
4,red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7


Let's think about the steps above. First we extracted a list of the columns, then we arranged the columns in the way we wanted them, then we constructed a new df using our custom ordering. If you think about this, this should give you some insight into how you can play with Pandas DataFrames.

Ok - let's move on.

In [9]:
#this brings up an important point - let's be clear about the difference between a new view of a df, and a new df.

data[columns].head()

#this looks re-arranged right? It is, but nothing has actually changed. This is just a custom view into the old df.
#that's why we re defined the 'reaaranged_df' object above. By redefining it, we actually create a new object with
#our desired column order.

Unnamed: 0,color,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
0,red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7
1,red,9.8,0.098,0.0,0.9968,7.8,25.0,3.2,5,2.6,0.68,67.0,0.88
2,red,9.8,0.092,0.04,0.997,7.8,15.0,3.26,5,2.3,0.65,54.0,0.76
3,red,9.8,0.075,0.56,0.998,11.2,17.0,3.16,6,1.9,0.58,60.0,0.28
4,red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7


We can prove this, by bringing up the original data:

In [10]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


See? Still the same :) Let's keep chugging now. We'll dive into other DataFrame operations now. By the end we'll build a function to score all of these wines for our personal tastes.

Let's get familiar with selecting slices of DataFrames. If you're a SQL wizard, some of this may feel familiar.

In [11]:
#Let's grab just rows that have high sugar content. How high? Let's say, higher than average. 

#What is the average?

rearranged_data['residual sugar'].mean()

5.4432353393874156

In [12]:
#now create a new df of high sweetness wines
high_sugar_wines = rearranged_data[rearranged_data['residual sugar'] > rearranged_data['residual sugar'].mean()]

#I've got a hunch that this sweetness characteristic is probably highly related to the color of the wine. 

In [13]:
#let's see if there's a big difference in reds vs. whites.

high_sugar_wines.color.value_counts()

white    2383
red        74
Name: color, dtype: int64

Looks like nearly all the really sweet wines, are white. 

Well, turns out I don't like whites at all. So let's revisit the selection we made. Now let's select all the sweet wines, but let's exclude all the white ones.

In [14]:
high_sugar_wines = high_sugar_wines[high_sugar_wines.color !='white']

In [15]:
high_sugar_wines.tail()

Unnamed: 0,color,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
1476,red,8.8,0.205,0.5,1.00242,9.9,48.0,3.16,5,13.8,0.75,82.0,0.5
1478,red,10.2,0.082,0.05,0.99808,7.1,3.0,3.4,3,5.7,0.52,14.0,0.875
1558,red,9.5,0.235,0.33,0.99787,6.9,66.0,3.22,5,6.7,0.56,115.0,0.63
1574,red,10.5,0.074,0.78,0.99677,5.6,23.0,3.39,6,13.9,0.48,92.0,0.31
1589,red,9.2,0.073,0.2,0.9977,6.6,29.0,3.29,5,7.8,0.54,79.0,0.725


Now you can see, we've chopped up the data, and selected just the rows we want.