# SI370 Day 3: Loading and manipulating data in pandas


# Reminders


## Learning Objectives
* (from last week) explain how boolean masks work in filtering DataFrames
* load CSV files
* load JSON files
* use pd.read_html to extract tables from web pages
* load data from simple APIs 
* handle missing data (dropna and fillna)
* use vectorized string functions

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = '?'

## <font color="magenta">Exercise 1 (10 minutes, 1 point):</font>
### a. Sign up for a Kaggle account (https://www.kaggle.com/).  Record your Kaggle username in the following markdown cell


Replace this with your Kaggle username

### b. Browse the Kaggle datasets (https://www.kaggle.com/datasets) and list two or three that you find interesting.  Explain why you find them interesting.

Insert your answer here.

# Today's focus: Loading (and manipulating) data using pandas

In [None]:
import pandas as pd

Recall the ```pd.read_csv``` function that we used to load data sets in previous classes:

In [None]:
menu = pd.read_csv('data/menu.csv') 

That works great for well-formatted CSV files, but what happens when you get something that looks like the ```data/avocado_eu.csv``` file.
Go ahead and browse that in JupyterLab's CSV browser.

You'll notice a new drop-down menu labelled "Delimiter".  Go ahead and change that to ```;```.

Referring back to your readings and the [read_csv documentation online](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), complete the following exercise


Read the data/avocado_eu.csv file into a pandas DataFrame and show the first 5 rows.


In [None]:
avocado = pd.read_csv('data/avocado_eu.csv')
avocado.head(5)

You'll notice that, unless you did something special in the previous read_csv invocation, the decimal points don't look quite right.  Go ahead and find the right option to convert commas to periods when loading a CSV file.  Also figure out how to "fix" the salary numbers so they don't contain commas.  

Hint: Use the documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Or you can search for something like "pandas read_csv decimal thousands comma": https://duckduckgo.com/?q=pandas+read_csv+decimal+thousands+comma&atb=v108-1&ia=web

## <font color="magenta">Exercise 2 (1 point):</font>
Read the data/avocado_eu.csv file using the correct delimiter and decimal character into a dataframe and show the first 5 rows:

In [None]:
# Insert your code here

# Counting the number of values

Sometimes, you'll want to count the number of times values occur.  For example, we might want to know the number of times each 'type'
is reported in our avocado data.  Use the ```value_counts()``` function on a Series to do so:

In [None]:
avocado['type'].value_counts()

# Loading JSON data

In addition to CSV files, JSON (JavaScript Object Notation) files or data is commonly used.  

In [None]:
nfl_football_players = pd.read_json('data/nfl_football_profiles.json')

In [None]:
nfl_football_players.head()

And, just for fun, show the player with the highest Current Salary from that dataset:

In [None]:
nfl_football_players.sort_values('current_salary', ascending=False).head(1)

# Fixing up the data
Assuming you did something like sort_values on one of the original columns, you probably got the wrong result.

Looking a bit more closely at the results, you'll notice that the current_salary column looks a bit weird.  Remembering that we have made the shift from pythonic to pandorable, we can leverage the impressive-sounding "vectorized string functions" mentioned in the McKinney book.  Specifically, we can use the str.replace(...) method.  Note that had we use read_csv to load the file we could have used the ```thousands=``` option and avoided all this, but sometimes data doesn't come in a convenient format.


One way to apply functions is to operate on a column and then assign the results to another column.  For example, if we wanted to eliminate commas, we could replace them with null strings

In [None]:
replaced = nfl_football_players['current_salary'].str.replace(',', '')
replaced.head()

And assign the results to a column in the original dataframe (in this case I'm calling the column current_salary_nocommas)

In [None]:
nfl_football_players['current_salary_nocommas'] = replaced

But you'll notice that the type of the column is string, and we want to convert it to a float so we can sort it numerically.  So we can use the astype() function to convert it:

In [None]:
nfl_football_players['current_salary_cleaned'] = nfl_football_players['current_salary_nocommas'].astype(float)
nfl_football_players.head(2)

And now we can re-run our command to sort by salary and get the correct result:

In [None]:
nfl_football_players.sort_values('current_salary_cleaned', ascending=False).head(1)

# Dropping missing values

In addition to the "all" or "any" functionality described in McKinney section 7.1, it's sometimes useful to drop a row only if a certain column or columns have missing data.  To do this, use the subset= option with dropna().  So, for example, to drop all players for whom we do not have salary information, we could use the following code:

In [None]:
nfl_football_players_salaries = nfl_football_players.dropna(subset=['current_salary_cleaned'])
nfl_football_players_salaries.head()

# Creating dummy variables

We might, on occasion, want to "bin" or "discretize" a variable.  For example, we might want to take the previous dataframe and add dummy variables that map onto whether the salaries are "small" (< \\$1M) , "medium" (\\$1M - \\$10M), or "large" (> \\$10M).  We could do something like the following:

In [None]:
bins = [0,1000000,10000000,1000000000]

In [None]:
dummies = pd.get_dummies(pd.cut(nfl_football_players_salaries['current_salary_cleaned'],bins,labels=['small','medium','large']))

In [None]:
dummies.head()

Now that we have a dataframe of dummy variables, we can concatenate it to the original dataframe (concatenating columns to the right rather than rows to the bottom using axis=1 or axis="columns")

In [None]:
nfl_cats = pd.concat([nfl_football_players_salaries,dummies],axis="columns")
nfl_cats.tail()

## <font color="magenta">Exercise 3 (2 points):
Create dummy variables for "draft_position", using bins of 1-100 and 101+, and concatenate the dummy variables to the ```nfl_cats``` dataframe.  Make good choices for the column names

In [None]:
# Insert your code here

# Scraping Tables from HTML

The ```pd.read_html``` function returns a list of DataFrames read from an HTML source.  The following line will return a _list_ of DataFrames from https://en.wikipedia.org/wiki/List_of_largest_sports_contracts

In [None]:
contracts_scraped = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_sports_contracts',header=0)

In [None]:
len(contracts_scraped)

To get the first table, you'll need to pull off the 0th element:

In [None]:
contracts = contracts_scraped[0]
contracts.head()

## <font color="magenta">Exercise 4 (1 point): </font>

Count the number of players from each sport in the List of Largest Sports Contracts 

Hint:  see value_counts() description above

In [None]:
# Insert your code here

For the final exercise, we're going to return to the nfl_football_players dataframe we created earlier.  

## <font color="magenta">Exercise 4 (5 points): </font>
Create a new dataframe that contains all the columns in the nfl_football_players dataframe as well as an additional column that contains each player's height in centimeters. Show the first 5 rows of your result.

hint: 1 inch = 2.54 cm

hint: you can use the vectorized string function str.split() to separate feet and inches from the original dataframe column, _you might want to figure out what expand=True does in split()_.

hint: remember to cast strings to numeric types if you're going to perform math on them

hint: you might want to create an intermediate (temporary) DataFrame to help you keep things clear instead of attempting to do this in one line 

In [None]:
nfl_football_players.columns

In [None]:
nfl_football_players.height.head(5)

In [None]:
heights = nfl_football_players.height.str.split('-')

In [None]:
heights.head()

In [None]:
heights = nfl_football_players.height.str.split('-', expand=True)

In [None]:
heights.head()

In [None]:
heights.columns = ['feet','inches']

In [None]:
heights.head()

In [None]:

nfl_football_players['cm'] = (heights['feet'].astype(float)*12+heights['inches'].astype(float))*2.54

In [None]:
nfl_football_players.head(1)

# APIs and requests (FYI only)
You've covered the ```requests``` package in previous courses.  This example shows what you can do with an API that returns JSON:

In [None]:
import requests

In [None]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [None]:
resp = requests.get(url)
resp

In [None]:
data = resp.json()

In [None]:
data[0]['title']

In [None]:
issues = pd.DataFrame(data)
issues.head()

In [None]:
issues.columns

In [None]:
issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])
issues.head()

## <font color="magenta">END OF NOTEBOOK</font>
Please submit this notebook in HTML format via Canvas.