# DATA 202 - Module 3: Cleaning Data
* Instructor: Dr. Josh Fagan
* [Jupyter Notebook Tips and Tricks](http://bit.ly/34embJh)
* [Markdown Cheatsheet](http://bit.ly/2UkNVXV)
* Magic command to list all variables: `%whos`

### Instructions

Welcome to the third assignment of DATA 202. This assignment is meant to help you review/familiarize yourself with cleaning datasets in Pandas

To receive credit for a assignment, answer all questions correctly and submit before the deadline listed on Canvas.

---
### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below.

**Collaborators**: *list collaborators here*
* Joseph Beller (vol study center tutoring)
* Y

---
## Exercises


In [1]:
# To answer all of the exercises you will need to import the pandas package. 
# Please do that below, in this code block.
import pandas as pd

### Exercise 0 - Loading and Basic Exploration

In this assignment, we will use the `wine_reviews.csv` dataset found on the Canvas site. Load the data into a `DataFrame` called `reviews`.

Display the first 5 rows of `reviews`.

In [2]:
reviews = pd.read_csv('/Users/carolinelpetersen/Desktop/data science/data/wine_reviews.csv')
reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


## Introduce Error
This is not so much an exercise as an organization subsection that introduces some error into our dataset that we need to fix. All you need to do is run the following cell block without edits. 

In [3]:
import random

random.seed(10)
i = random.randint(0, len(reviews))
reviews.points.at[i] = reviews.points[i]*10
i = random.randint(0, len(reviews))
reviews = pd.concat([reviews, reviews.iloc[i:i+3]])
reviews = reviews.reset_index()
reviews.points = reviews.points.astype(str)

## Exercise 1
Display the number of 'NaN' values for each feature. Which column in `reviews` has the most `NaN` values?

In [4]:
pd.isnull(reviews).sum()

index                        0
Unnamed: 0                   0
country                     63
description                  0
designation              37468
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79461
taster_name              26244
taster_twitter_handle    31214
title                        0
variety                      1
winery                       0
dtype: int64

Your answer here: region_2  

### Exercise 1 Grading Notes

Exercise 1 Grade:

14/14

## Exercise 2
There are a lot of missing `region_2` values, so much so, that it might not be useful to include this feature in our analysis. 

Remove the `region_2` feature from the dataset. 

Display the first 5 rows to make sure changes have been made.

Hint: Explore pandas `drop()` function.



In [5]:
reviews = reviews.drop(columns=['region_2'])
reviews.head()

Unnamed: 0.1,index,Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,taster_twitter_handle,title,variety,winery
0,0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Exercise 2 Grading Notes

Exercise 2 Grade:

14/14

## Exercies 3
There is only one record with a missing `variety` field, and that is a pretty important field. 

Remove the record with the missing `variety` field from the dataset. Check to see what changes have been made on reviews by displaying the number of NaN values in the `variety` column after the operation. 

Hints: we can easily use `dropna()` by taking advantage of the `subset` parameter.

In [6]:
reviews = reviews.dropna(subset=['variety'])
pd.isnull(reviews['variety']).sum()

0

### Exercise 3 Grading Notes

Exercise 3 Grade:

14/14

## Exercise 4
Replace all `NaN` values in the `price` column with the average price from the full `DataFrame`. Check to see what changes have been made on `reviews` by displaying the number of `NaN` values in the price column after the operation. 

In [7]:
x = reviews['price'].mean()
reviews['price'].fillna(x, inplace = True)
pd.isnull(reviews['price']).sum()

0

### Exercise 4 Grading Notes

Exercise 4 Grade:

14/14

## Exercise 5
Use the `.describe()` function on the `price` column to see basic statistics on all the values in the column. 

In [8]:
reviews['price'].describe()

count    129973.000000
mean         35.362986
std          39.576876
min           4.000000
25%          18.000000
50%          28.000000
75%          40.000000
max        3300.000000
Name: price, dtype: float64

Now do the same thing with the `points` column. 

In [9]:
reviews['points'].describe()

count     129973
unique        22
top           88
freq       17206
Name: points, dtype: object

Why do we not get all of the statistics with the `points` column? There is no mean, min, max, or any of the quartiles. Let's look and find out why. 

Execute the function to check the datatype of each column.

In [10]:
reviews.dtypes

index                      int64
Unnamed: 0                 int64
country                   object
description               object
designation               object
points                    object
price                    float64
province                  object
region_1                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

From this we can see the `price` column is `float` but the `points` column is a `string`. We know `points` are integers, so we clearly have to clean our data of this wrong format. This will allow us to perform mathematical calculations on the data to do such tasks as finding the maximum point from a region, find the average points per price, and finding median point value over all. 

In the cell below, cast the `point` column as an `int`.

In [11]:
reviews.points = reviews.points.astype(int)

In [12]:
reviews.dtypes

index                      int64
Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

Now run the `describe()` command again to see the statistics specific to the `point` feature. 

In [13]:
reviews.points.describe()

count    129973.000000
mean         88.453640
std           3.866619
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         950.000000
Name: points, dtype: float64

### Exercise 5 Grading Notes

Exercise 5 Grade:

14/14

## Exercise 6
### Preamble
In this exercise we want to correct any erroneous `points` values. As stated previously, we can use meta data to check to see what values features can and should take on. We have looked at the `point` feature before, but just as a recap, the values range from 80-100, again based on supplied metadata and documentation. 

Run the cell below to see if we have any erroneous `points` values. 

In [14]:
reviews.points.value_counts().sort_index()

80       397
81       692
82      1836
83      3025
84      6480
85      9533
86     12600
87     16933
88     17206
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1534
96       523
97       229
98        77
99        33
100       19
950        1
Name: points, dtype: int64

From this output we can see there is one value well above 100. We have one of two options for fixing our value.

1) Remove the whole record. 
2) Try to guess what the point value should be.

We can see the value ends with a "0" so we could guess that a decimal was placed incorrectly and try to fix it by just removing the extra order of magnitude position.

Let's go with our second option and correct the value by effectively dividing by 10. 

### Problem Statement
1. Find the location of the erroneous point value.
2. Change the value to be its current value, divided by 10. 

Hints: 
1. The erroneous point value is the maximum value in the `points` colum. You can use this to help you find the right location, or just use the hard coded value in a comparison. 
2. Use loc[] to avoid any warnings

In [15]:
reviews.loc[reviews.points == 950].index[0]

74894

In [16]:
reviews.loc[74894, 'points'] = 950/10

Run the cell below to check and make sure you have solved the issue. 

In [17]:
reviews.points.value_counts().sort_index()

80       397
81       692
82      1836
83      3025
84      6480
85      9533
86     12600
87     16933
88     17206
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64

### Exercise 6 Grading Notes

Deductions:
- Hardcoded the values (-2)

Exercise 5 Grade:

13/15

## Exercise 7

Show what duplicate lines exist in the dataset.

In [20]:
print(reviews.duplicated())

0         False
1         False
2         False
3         False
4         False
          ...  
129969    False
129970    False
129971     True
129972     True
129973     True
Length: 129973, dtype: bool


Remove the duplicate lines.

In [21]:
reviews.drop_duplicates(inplace = True)

Perform another check to see if you fixed the problem

In [22]:
print(reviews.duplicated().sum())

0


### Exercise 7 Grading Notes

Exercise 7 Grade:

15/15

---
## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. I recommend going to the "Kernel" menu at the top and selecting "Restart & Run All". This will ensure that everything runs correctly when it is run sequentially. 

### Final Grade
98/100