# Practice 1

## Instructions

You're going to practice working in pandas. 


You'll walk through instantiating a `DataFrame`, reading data into it, looking at and examining that data, and then playing with it. 


A dataset on the [quality of red wines](https://archive.ics.uci.edu/ml/datasets/wine+quality) is used for this purpose.
It is located in the `data` folder within this directory. It's called `winequality-red.csv`. 


Typically, we use Jupyter notebooks like this for a very specific set of things - presentations and EDA. 


Today, as we'll be playing around with `pandas`, much of what we'll be doing is considered EDA. Therefore, by using a notebook, we'll get a tighter feedback loop with our work than we would trying to write a script. But, in general, **we do not use Jupyter notebooks for development**. 

Below, we've put a set of questions and then a cell for you to work on answers. However, feel free to add additional cells if you'd like. Often it will make sense to use more than one cell for your answers. 

## Assignment Questions 

### Part 1 - The Basics of DataFrames

Let's start off by following the general workflow that we use when moving data into a DataFrame: 

    * Importing pandas
    * Reading data into the DataFrame
    * Getting a general sense of the data

So, in terms of what you should do for this part...


1. Import pandas

In [1]:
import pandas as pd

2. Read the wine data into a DataFrame. 

In [6]:
r_wine_df = pd.read_csv("data/winequality-red.csv", delimiter= ";")

3. Use the `attributes` and `methods` available on DataFrames to answer the following questions: 
    * How many rows and columns are in the DataFrame?
        - 1599 rows and 12 columns
    * What data type is in each column?
        - float type for columns 0-10, int type for column 11
    * Are all of the variables continuous, or are any categorical?
        - col 11 is categorical (describes the category "quality")
    * How many non-null values are in each column?
        -1599 in each
    * What are the min, mean, max, median for all numeric columns?
        - see result of describe() (50% refers to the median)

In [8]:
r_wine_df.info()
r_wine_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


### Part 2 - Practice with Grabbing Data

Let's now get some practice with grabbing certain parts of the data. If you'd like some extra practice, try answering each of the questions in more than one way (because remember, we can often grab our data in a couple of different ways). 

1. Grab the first 10 rows of the `chlorides` column. 


In [26]:
print(r_wine_df.loc[:10, "chlorides"])      #why is this printing 0-10 instead of 0-9?
print(r_wine_df.iloc[:10, 4])           
print(r_wine_df.iloc[:9, 4])

0     0.076
1     0.098
2     0.092
3     0.075
4     0.076
5     0.075
6     0.069
7     0.065
8     0.073
9     0.071
10    0.097
Name: chlorides, dtype: float64
0    0.076
1    0.098
2    0.092
3    0.075
4    0.076
5    0.075
6    0.069
7    0.065
8    0.073
9    0.071
Name: chlorides, dtype: float64
0    0.076
1    0.098
2    0.092
3    0.075
4    0.076
5    0.075
6    0.069
7    0.065
8    0.073
Name: chlorides, dtype: float64


2. Grab the last 10 rows of the `chlorides` column. 


In [27]:
print(r_wine_df.tail(10)["chlorides"])

1589    0.073
1590    0.077
1591    0.089
1592    0.076
1593    0.068
1594    0.090
1595    0.062
1596    0.076
1597    0.075
1598    0.067
Name: chlorides, dtype: float64


3. Grab indices 264-282 of the `chlorides` **and** `density` columns. 


In [25]:
#print(r_wine_df.loc[264:282, ['chlorides', 'density']])
print(r_wine_df.iloc[264:282, [4, 7]])

     chlorides  density
264      0.064  0.99990
265      0.071  0.99680
266      0.096  1.00025
267      0.078  0.99730
268      0.077  0.99870
269      0.104  0.99960
270      0.087  0.99650
271      0.104  0.99960
272      0.071  0.99935
273      0.076  0.99735
274      0.088  0.99915
275      0.087  0.99650
276      0.077  0.99870
277      0.104  0.99960
278      0.073  0.99760
279      0.087  0.99910
280      0.071  0.99860
281      0.358  0.99720


4. Grab all rows where the `chlorides` value is less than 0.10. 


In [29]:
r_wine_df.query("chlorides < 0.10")

fixed acidity            15.90000
volatile acidity          1.33000
citric acid               0.78000
residual sugar           15.50000
chlorides                 0.09900
free sulfur dioxide      72.00000
total sulfur dioxide    289.00000
density                   1.00315
pH                        4.01000
sulphates                 1.62000
alcohol                  14.90000
quality                   8.00000
dtype: float64

5. Now grab all the rows where the `chlorides` value is greater than the column's mean (try **not** to use a hard-coded value for the mean, but instead a method).

In [33]:
r_wine_df.query("chlorides > chlorides.mean()")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
10,6.7,0.580,0.08,1.8,0.097,15.0,65.0,0.99590,3.28,0.54,9.2,5
12,5.6,0.615,0.00,1.6,0.089,16.0,59.0,0.99430,3.58,0.52,9.9,5
13,7.8,0.610,0.29,1.6,0.114,9.0,29.0,0.99740,3.26,1.56,9.1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1558,6.9,0.630,0.33,6.7,0.235,66.0,115.0,0.99787,3.22,0.56,9.5,5
1570,6.4,0.360,0.53,2.2,0.230,19.0,35.0,0.99340,3.37,0.93,12.4,6
1578,6.8,0.670,0.15,1.8,0.118,13.0,20.0,0.99540,3.42,0.67,11.3,6
1591,5.4,0.740,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6,6


6. Grab all those rows where the `pH` is greater than 3.0 and less than 3.5. 

In [36]:
r_wine_df.query("3.0 < pH < 3.5")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
6,7.9,0.60,0.06,1.6,0.069,15.0,59.0,0.99640,3.30,0.46,9.4,5
7,7.3,0.65,0.00,1.2,0.065,15.0,21.0,0.99460,3.39,0.47,10.0,7
...,...,...,...,...,...,...,...,...,...,...,...,...
1592,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1593,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
1594,6.2,0.60,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6



7. Further filter the results from 6 to grab only those rows that have a `residual sugar` less than 2.0. 

In [42]:
r_wine_df.rename(columns={'residual sugar': 'residual_sugar'}).query("3.0 < pH < 3.5 and residual_sugar < 2.0")
#r_wine_df.query("3.0 < pH < 3.5 and residual_sugar < 2.0")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual_sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
6,7.9,0.60,0.06,1.6,0.069,15.0,59.0,0.99640,3.30,0.46,9.4,5
7,7.3,0.65,0.00,1.2,0.065,15.0,21.0,0.99460,3.39,0.47,10.0,7
10,6.7,0.58,0.08,1.8,0.097,15.0,65.0,0.99590,3.28,0.54,9.2,5
13,7.8,0.61,0.29,1.6,0.114,9.0,29.0,0.99740,3.26,1.56,9.1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1569,6.2,0.51,0.14,1.9,0.056,15.0,34.0,0.99396,3.48,0.57,11.5,6
1576,8.0,0.30,0.63,1.6,0.081,16.0,29.0,0.99588,3.30,0.78,10.8,6
1578,6.8,0.67,0.15,1.8,0.118,13.0,20.0,0.99540,3.42,0.67,11.3,6
1590,6.3,0.55,0.15,1.8,0.077,26.0,35.0,0.99314,3.32,0.82,11.6,6


In [44]:
r_wine_df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

### Part 3 - More Practice

Let's move on to some more complicated things. Use your knowledge of `groupby`s, `sorting` to answer the following. 

1. Get the average amount of `chlorides` for each `quality` value.

In [46]:
r_wine_df.groupby("quality").chlorides.mean()

quality
3    0.122500
4    0.090679
5    0.092736
6    0.084956
7    0.076588
8    0.068444
Name: chlorides, dtype: float64

 2. For observations with a `pH` greater than 3.0 and less than 4.0, find the average `alcohol` value by `pH`. 

In [49]:
r_wine_df.query('3.0 < pH < 4.0').groupby('pH').alcohol.mean()

pH
3.01    11.320000
3.02    10.200000
3.03     9.633333
3.04     9.740000
3.05    10.050000
          ...    
3.74    11.500000
3.75    10.500000
3.78    12.400000
3.85    12.900000
3.90    12.950000
Name: alcohol, Length: 75, dtype: float64

3. For observations with an `alcohol` value between 9.25 and 9.5, find the highest amount of `residual sugar`. 

In [58]:
r_wine_df.rename(columns={"residual sugar": "residual_sugar"}, inplace=True)
r_wine_df.query("9.25 < alcohol < 9.5").residual_sugar.max()

10.7

4. Create a new column, called `total_acidity`, that is the sum of `fixed acidity` and `volatile acidity`. 

In [60]:
r_wine_df.rename(columns={"fixed acidity": "fixed_acidity"}, inplace=True)
r_wine_df.rename(columns={"volatile acidity": "volatile_acidity"}, inplace=True)
r_wine_df.eval("total_acidity = fixed_acidity + volatile_acidity", inplace=True)

In [61]:
r_wine_df.columns

Index(['fixed_acidity', 'volatile_acidity', 'citric acid', 'residual_sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'total_acidity'],
      dtype='object')

5. Find the average `total_acidity` for each of the `quality` values. 

In [62]:
r_wine_df.groupby('quality').total_acidity.mean()

quality
3    9.244500
4    8.473208
5    8.744295
6    8.844663
7    9.276281
8    8.990000
Name: total_acidity, dtype: float64

6. Find the top 5 `density` values. 

In [63]:
r_wine_df.sort_values("density", ascending=False)[:5]

Unnamed: 0,fixed_acidity,volatile_acidity,citric acid,residual_sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,total_acidity
1434,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0,6,10.74
1435,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0,6,10.74
442,15.6,0.685,0.76,3.7,0.1,6.0,43.0,1.0032,2.95,0.68,11.2,7,16.285
554,15.5,0.645,0.49,4.2,0.095,10.0,23.0,1.00315,2.92,0.74,11.1,5,16.145
555,15.5,0.645,0.49,4.2,0.095,10.0,23.0,1.00315,2.92,0.74,11.1,5,16.145


7. Find the 10 lowest `sulphates` values. 

In [64]:
r_wine_df.sort_values("sulphates")[:10]

Unnamed: 0,fixed_acidity,volatile_acidity,citric acid,residual_sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,total_acidity
170,7.9,0.885,0.03,1.8,0.058,4.0,8.0,0.9972,3.36,0.33,9.1,4,8.785
1369,6.6,0.61,0.0,1.6,0.069,4.0,8.0,0.99396,3.33,0.37,10.4,4,7.21
1287,8.0,0.6,0.08,2.6,0.056,3.0,7.0,0.99286,3.22,0.37,13.0,5,8.6
1347,7.2,0.655,0.03,1.8,0.078,7.0,12.0,0.99587,3.34,0.39,9.5,5,7.855
1348,7.2,0.655,0.03,1.8,0.078,7.0,12.0,0.99587,3.34,0.39,9.5,5,7.855
65,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.9962,3.41,0.39,10.9,5,7.925
837,6.7,0.28,0.28,2.4,0.012,36.0,100.0,0.99064,3.26,0.39,11.7,7,6.98
64,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.9962,3.41,0.39,10.9,5,7.925
836,6.7,0.28,0.28,2.4,0.012,36.0,100.0,0.99064,3.26,0.39,11.7,7,6.98
1237,7.1,0.75,0.01,2.2,0.059,11.0,18.0,0.99242,3.39,0.4,12.8,6,7.85
