# Objectives
1. Use Pandas methods to work with missing data
2. Convert columns from one data type to another

We will again use the PGA Stats CSV file that we used in Week 1 as well as previously in the course. Run the following code cell to import the libraries that we will use in this lab. Note, we have added the statement `%matplotlib inline` which is an IPython magic command that tells Jupyter Notebook to render Matplotlib plots inline rather than as part of an output. You can read more [here](https://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Part%203%20-%20Plotting%20with%20Matplotlib.ipynb).

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load the Data
We are again going to use the PGA Stats dataset.

**Q1.1.** Read the CSV file to a DataFrame named `pga` and specify the index as the player's name. Remember, the file must be in the same directory as this Jupyter Notebook or you must specify the entire file path. Inspect the first five rows.

In [1]:
# read file and set index col


# inspect



**Q1.2.** Next, summarize the data to understand the types of data and some summary statistics for numeric columns.

In [None]:
# info and describe



Do any of the summary statistics for `RK` make sense? Why are `AGE`, `CUP POINTS`, and `EARNINGS` not represented in the summary statistics?

**Q1.3.** It may not be obvious from the output of `head`, but `CUP POINTS` and `EARNINGS` have commas and dollar signs which make it unclear for Python to determine the appropriate computational data type. As a result, it stores them as a generic object. However, it is not as obvious why `AGE` is not numeric. To understand why, inspect the last five rows of the data set using the `tail` method.

In [None]:
# inspect last 10


In this dataset, there are '--' representing either unknown quantities, a zero, or a not applicable designation. Ideally, we would speak with the owner of the data to clarify how each of these are encoded. In reality, this is often not possible and the same value may be used to encode all three different cases. Often, we must infer how to handle these types of values based on the data. For this exercise, we will encode those values with NaN (not a number) values.

**Q1.4.** In the below cell, again read the 'PGA Stats.csv' file to a DataFrame named `pga`, but specify '--' as the argument to the `na_values` parameter. This parameter will convert '--' occurences to an NaN. There are many other parameters for the `read_csv` function in order to handle a variety of issues with data files. You can find the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

After creating the DataFrame, use the `info` and `describe` methods to note the differences. What computational data type is the `AGE` column and how many non-null elements are there? What are the summary statistics?

In [3]:
# reread file, specify na_values for '--'




Run the cell below. It is removing the dollar signs and commas and then converting the column to a float so that we can analyze it. **Note: the `replace` method used below is a string method which is why we access the `str` attribute first. This is different than the `replace` method from the reading.**

The statement that is commented out is equivalent to the first two statements. It is using regular expressions (regex) to replace dollar signs and commas with blanks and then converting it to a float. Regular expressions are outside the scope of this class.

In [4]:
pga['EARNINGS'] = pga['EARNINGS'].str.replace('$','')
pga['EARNINGS'] = pga['EARNINGS'].str.replace(',','').astype(float)
# pga['EARNINGS'] = pga['EARNINGS'].replace('[\$,]','', regex = True).astype(float)
pga.head()

**Q1.5.** Complete the same process and remove the commas for the `CUP POINTS` column. Inspect the first five rows.

In [5]:
# remove commas from cup points


# 2. Missing Values
Now that we have translated '--' to NaN values and removed dollar signs and commas, all of our data should be numeric. However, there are still several missing values in the AGE and EARNINGS columns.

**Q2.1.** In the below cell, create an ordered list using Markdown to list at least two strategies to deal with the missing data.

**Q2.2.** We will assume that a missing value in the EARNINGS column means that they did not earn any money. Fill these missing values with zeros. Apply the `info` method to verify the null values were filled.

In [6]:
# fill null values with 0

# verify


**Q2.3.** For the missing values in the AGE column, it is less clear what to do. We know that there are 156 players with an unknown age and we know that the DataFrame is sorted from top ranked to worst ranked players. In the below cell, inspect the first 10 players with an unknown age.

In [7]:
# check out first 10 players with unknown age


It turns out that there are some well known and successful players with an unknown age. Poor Benjamin Silverman won nearly $800K and ESPN can't track down his age. We will leave players with an unknown age as an NaN value.

### 3. Plotting
**Q3.1.** Create a histogram of the number of events to understand the distribution. First, use the default number of bins to get an idea of the frequency of events played. Then, use the `bins` parameter to customize the size of the bins in order to get more granularity. How many values can EVENTS take on? What argument value makes sense in this case?

In [8]:
# plot histogram



There are many players in the dataset, like Tony Romo, who only play in a single tournament based on sponsor's exemptions. As you can see, only 403 of 559 players have a recorded age and only 253 players earned money. 

**Q3.2.** Next, subset the data to those players who played in at least three events. Name the resulting DataFrame `pga_pros`. Use the `describe` method to view the summary statistics of each numeric column. 

In [9]:
# subset events > 2

# describe() to verify


**Q3.3.** In the below cell, create a scatter plot with Events on the x-axis and Earnings on the y-axis. Specify the horizontal axis label with 'Events Played' and the vertical axis label with 'Earnings'. Notice how I specified the y-axis ticks with a minimum of 0, a maximum of 10 million, and an interval of 2 million and customized the tick labels so they appear as \\$0M, \\$2M, etc.

In [10]:
# scatter plot events vs earnings


# add x and y labels


# changing y-axis ticks
plt.yticks([0, 2e6, 4e6, 6e6, 8e6, 10e6], ['$0M', '$2M', '$4M', '$6M', '$8M', '$10M'])
plt.show()