# SI370 Day 3: Loading and manipulating data in pandas
## Please download Day3.zip from Canvas -> Files and start working on the first Exercise


# Reminders


## Learning Objectives
* load weird CSV files
* load JSON files
* use pd.read_html to extract tables from web pages
* load data from simple APIs 
* load data from a SQL database
* handle missing data (dropna and fillna)
* use vectorized string functions
* Pandas refresher (or introduction)
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [18]:
MY_UNIQNAME = '?'

## <font color="magenta">Exercise 1 (2 points): :</font>
### a. Sign up for a Kaggle account (https://www.kaggle.com/).  Record your Kaggle username in the following markdown cell


Replace this with your Kaggle username

### b. Browse the Kaggle datasets (https://www.kaggle.com/datasets) and list two or three that you find interesting.  Explain why you find them interesting.

Insert your answer here.

# Today's focus: Loading (and manipulating) data using pandas

In [1]:
import pandas as pd

Recall the ```pd.read_csv``` function that we used to load data sets in previous classes:

In [35]:
menu = pd.read_csv('data/menu.csv') 

That works great for well-formatted CSV files, but what happens when you get something that looks like the ```data/avocado_eu.csv``` file.
Go ahead and browse that in JupyterLab's CSV browser.

You'll notice a new drop-down menu labelled "Delimiter".  Go ahead and change that to ```;```.

Referring back to your readings and the [read_csv documentation online](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), complete the following exercise


Read the data/avocado_eu.csv file into a pandas DataFrame and show the first 5 rows.


In [4]:
avocado = pd.read_csv('data/avocado_eu.csv', delimiter=';')
avocado.head(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,133,6423662,103674,5445485,4816,869687,860362,9325,0,conventional,2015,Albany
1,1,2015-12-20,135,5487698,67428,4463881,5833,950556,940807,9749,0,conventional,2015,Albany
2,2,2015-12-13,93,11822022,7947,10914967,1305,814535,804221,10314,0,conventional,2015,Albany
3,3,2015-12-06,108,7899215,11320,7197641,7258,581116,56774,13376,0,conventional,2015,Albany
4,4,2015-11-29,128,510396,94148,4383839,7578,618395,598626,19769,0,conventional,2015,Albany


You'll notice that, unless you did something special in the previous read_csv invocation, the decimal points don't look quite right.  Go ahead and find the right option to convert commas to periods when loading a CSV file.

## Exercise 2 (1 point):
Read the data/avocado_eu.csv file using the correct delimiter and decimal character into a dataframe and show the first 5 rows:

In [5]:
# insert your code here

# Counting the number of values

Sometimes, you'll want to count the number of times values occur.  For example, we might want to know the number of times each 'type'
is reported in our avocado data.  Use the ```value_counts()``` function on a Series to do so:

In [40]:
avocado['type'].value_counts()

conventional    9126
organic         9123
Name: type, dtype: int64

# Loading JSON data

In addition to CSV files, JSON (JavaScript Object Notation) files or data is commonly used.  

In [9]:
nfl_football_players = pd.read_json('data/nfl_football_profiles.json')

In [11]:
nfl_football_players.head()

Unnamed: 0,birth_date,birth_place,college,current_salary,current_team,death_date,draft_position,draft_round,draft_team,draft_year,height,high_school,hof_induction_year,name,player_id,position,weight
0,1967-05-12,"Bay City, TX",Baylor,,,,34.0,2.0,Seattle Seahawks,1990.0,6-0,"Van Vleck, TX",,Robert Blackmon,1809,DB,208.0
1,1970-07-20,"Louisville, KY",Kentucky,,,,85.0,4.0,Seattle Seahawks,1993.0,6-3,"Holy Cross, KY",,Dean Wells,23586,LB,248.0
2,1990-08-14,"Newton, MA",Oregon,1075000.0,Miami Dolphins,,46.0,2.0,Buffalo Bills,2013.0,6-3,"Los Gatos, CA",,Kiko Alonso,355,ILB,238.0
3,1948-04-22,"Dallas, TX",North Texas,,,1999-10-15,126.0,5.0,New Orleans Saints,1970.0,6-2,"W.W. Samuell, TX",,Steve Ramsey,18182,QB,210.0
4,1988-02-27,"Neptune, NJ",Miami (FL),,,,,,,,6-0,"Neptune, NJ",,Cory Nelms,16250,CB,195.0


And, just for fun, show the player with the highest Current Salary from that dataset:

In [49]:
# Insert your code here

# Fixing up the data
Assuming you did something like sort_values on one of the original columns, you probably got the wrong result.

Looking a bit more closely at the results, you'll notice that the current_salary column.  Remembering that we have made the shift from pythonic to pandorable, we can leverage the impressive-sounding "vectorized string functions" mentioned in Section XXX of the McKinney book.  Specifically, we can use the str.replace(...) method.  Note that had we use read_csv to load the file we could have used the ```thousands=``` option and avoided all this, but sometimes data doesn't come in a convenient format.

One way to apply functions is to operate on a column and then assign the results to another column.  For example, if we wanted to eliminate commas, we could replace them with null strings


In [None]:
nfl_football_players['current_salary'].str.replace(',', '')

And assign the results to a column in the original dataframe (in this case I'm calling the column current_salary_nocommas)

In [52]:
nfl_football_players['current_salary_nocommas'] = nfl_football_players['current_salary'].str.replace(',', '')

But you'll notice that the type of the column is string, and we want to convert it to a float so we can sort it numerically.  So we can use the astype() function to convert it:

In [53]:
nfl_football_players['current_salary_cleaned'] = nfl_football_players['current_salary_nocommas'].astype(float)

And now we can re-run our command to sort by salary and get the correct result:

In [55]:
nfl_football_players.sort_values('current_salary_cleaned', ascending=False).head(1)

Unnamed: 0,birth_date,birth_place,college,current_salary,current_team,death_date,draft_position,draft_round,draft_team,draft_year,height,high_school,hof_induction_year,name,player_id,position,weight,current_salary_cleaned,current_salary_nocommas
17756,1988-08-19,"Holland, MI",Michigan St.,23943600,Washington Redskins,,102.0,4.0,Washington Redskins,2012.0,6-3,"Holland Christian, MI",,Kirk Cousins,4644,QB,214.0,23943600.0,23943600


## Exercise X: Create a new column in the nfl_football_players DataFrame that contains each player's height in centimeters.
hint: 1 inch = 2.54 cm

hint: you can use the vectorized string function str.split() to separate feet and inches from the original dataframe column

hint: remember to cast strings to numeric types if you're going to perform math on them

hint: you might want to create an intermediate (temporary) DataFrame to help you keep things clear instead of attempting to do 
this in one line 

In [84]:
heights = nfl_football_players['height'].str.split('-',expand=True)

In [91]:
heights['cm'] = ( heights[0].astype(float) * 12 + heights[1].astype(float) ) * 2.54
                    

In [93]:
pd.concat([nfl_football_players,heights['cm']],axis=1)

Unnamed: 0,birth_date,birth_place,college,current_salary,current_team,death_date,draft_position,draft_round,draft_team,draft_year,...,hof_induction_year,name,player_id,position,weight,current_salary_cleaned,current_salary_nocommas,inches,feet,cm
0,1967-05-12,"Bay City, TX",Baylor,,,,34.0,2.0,Seattle Seahawks,1990.0,...,,Robert Blackmon,1809,DB,208.0,,,1,0,182.88
1,1970-07-20,"Louisville, KY",Kentucky,,,,85.0,4.0,Seattle Seahawks,1993.0,...,,Dean Wells,23586,LB,248.0,,,1,0,190.50
2,1990-08-14,"Newton, MA",Oregon,1075000,Miami Dolphins,,46.0,2.0,Buffalo Bills,2013.0,...,,Kiko Alonso,355,ILB,238.0,1075000.0,1075000,1,0,190.50
3,1948-04-22,"Dallas, TX",North Texas,,,1999-10-15,126.0,5.0,New Orleans Saints,1970.0,...,,Steve Ramsey,18182,QB,210.0,,,1,0,187.96
4,1988-02-27,"Neptune, NJ",Miami (FL),,,,,,,,...,,Cory Nelms,16250,CB,195.0,,,1,0,182.88
5,1982-08-18,"Columbus, GA",Notre Dame,,,,144.0,5.0,St. Louis Rams,2005.0,...,,Jerome Collins,4310,TE,267.0,,,1,0,193.04
6,1992-10-27,"Cincinnati, OH",Louisville,1762000,Buffalo Bills,,73.0,3.0,Buffalo Bills,2014.0,...,,Preston Brown,2701,ILB,251.0,1762000.0,1762000,1,0,185.42
7,1945-03-17,"Steubenville, OH",Wyoming,,,,,,,,...,,Hub Lindsey,13379,RB,196.0,,,1,0,180.34
8,1978-11-11,"Suitland, MD",Maryland,,,,49.0,2.0,New York Jets,2001.0,...,,LaMont Jordan,11755,RB,230.0,,,1,0,177.80
9,1921-10-06,"Brooklyn, NY",Dartmouth,,,,178.0,18.0,Pittsburgh Steelers,1945.0,...,,Alex Wizbicki,24550,DB-HB,188.0,,,1,0,180.34


In [94]:
nfl_football_players.head()

Unnamed: 0,birth_date,birth_place,college,current_salary,current_team,death_date,draft_position,draft_round,draft_team,draft_year,...,high_school,hof_induction_year,name,player_id,position,weight,current_salary_cleaned,current_salary_nocommas,inches,feet
0,1967-05-12,"Bay City, TX",Baylor,,,,34.0,2.0,Seattle Seahawks,1990.0,...,"Van Vleck, TX",,Robert Blackmon,1809,DB,208.0,,,1,0
1,1970-07-20,"Louisville, KY",Kentucky,,,,85.0,4.0,Seattle Seahawks,1993.0,...,"Holy Cross, KY",,Dean Wells,23586,LB,248.0,,,1,0
2,1990-08-14,"Newton, MA",Oregon,1075000.0,Miami Dolphins,,46.0,2.0,Buffalo Bills,2013.0,...,"Los Gatos, CA",,Kiko Alonso,355,ILB,238.0,1075000.0,1075000.0,1,0
3,1948-04-22,"Dallas, TX",North Texas,,,1999-10-15,126.0,5.0,New Orleans Saints,1970.0,...,"W.W. Samuell, TX",,Steve Ramsey,18182,QB,210.0,,,1,0
4,1988-02-27,"Neptune, NJ",Miami (FL),,,,,,,,...,"Neptune, NJ",,Cory Nelms,16250,CB,195.0,,,1,0


In [95]:
hof = nfl_football_players.dropna('hof_induction_year')

ValueError: No axis named hof_induction_year for object type <class 'pandas.core.frame.DataFrame'>

# Scraping Tables from HTML

The ```pd.read_html``` function returns a list of DataFrames read from an HTML source.  The following line will return a list of DataFrames from https://en.wikipedia.org/wiki/List_of_largest_sports_contracts

In [30]:
contracts_scraped = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_sports_contracts',header=0)

In [31]:
len(contracts_scraped)

1

To get the first table, you'll need to pull off the 0th element:

In [27]:
contracts = contracts[0]

## Exercise X: Count the number of players from each sport in the List of Largest Sports Contracts (hint:  see value_counts() description above

In [48]:
contracts['Sport'].value_counts()


Baseball                55
Basketball              26
American football       15
Auto racing              2
Association football     1
Hockey                   1
Name: Sport, dtype: int64

# API and requests

In [96]:
import requests

In [97]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [98]:
resp = requests.get(url)
resp

<Response [200]>

In [99]:
data = resp.json()

In [105]:
data[0]['title']

'BUG: Some sas7bdat files with many columns are not parseable by read_sas'

In [106]:
issues = pd.DataFrame(data)
issues.head()

Unnamed: 0,assignee,assignees,author_association,body,closed_at,comments,comments_url,created_at,events_url,html_url,...,milestone,node_id,number,pull_request,repository_url,state,title,updated_at,url,user
0,,[],NONE,- [X] tests added / passed\r\n- [X] passes `gi...,,3,https://api.github.com/repos/pandas-dev/pandas...,2018-09-07T15:07:08Z,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/22628,...,,MDExOlB1bGxSZXF1ZXN0MjEzOTU3NDg3,22628,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,BUG: Some sas7bdat files with many columns are...,2018-09-07T16:53:48Z,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'troels', 'id': 3203, 'node_id': 'MD..."
1,,[],NONE,"#### Code Sample, a copy-pastable example if p...",,1,https://api.github.com/repos/pandas-dev/pandas...,2018-09-07T13:10:56Z,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/22627,...,,MDU6SXNzdWUzNTgwNjEyMzI=,22627,,https://api.github.com/repos/pandas-dev/pandas,open,Series.reorder_levels docstring includes extra...,2018-09-07T15:55:01Z,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'tschm', 'id': 2046079, 'node_id': '..."
2,,[],CONTRIBUTOR,Off the back of discussion in this [PR](https:...,,0,https://api.github.com/repos/pandas-dev/pandas...,2018-09-06T23:27:53Z,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/22624,...,,MDU6SXNzdWUzNTc4NjUyODc=,22624,,https://api.github.com/repos/pandas-dev/pandas,open,Refactor test_sql.py,2018-09-06T23:38:20Z,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'alimcmaster1', 'id': 16733618, 'nod..."
3,,[],NONE,- [x] tests added / passed\r\n- [x] passes `gi...,,9,https://api.github.com/repos/pandas-dev/pandas...,2018-09-06T17:23:20Z,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/22622,...,,MDExOlB1bGxSZXF1ZXN0MjEzNjkyMjU2,22622,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,Add DataFrame.corrmatrix() method,2018-09-07T13:15:31Z,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'Mottl', 'id': 4404433, 'node_id': '..."
4,,[],NONE,"#### Code Sample, a copy-pastable example if p...",,2,https://api.github.com/repos/pandas-dev/pandas...,2018-09-06T15:04:20Z,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/22621,...,,MDU6SXNzdWUzNTc2OTkxMTA=,22621,,https://api.github.com/repos/pandas-dev/pandas,open,Dtype inconsistency when appending to empty da...,2018-09-07T16:51:44Z,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jonathanrocher', 'id': 593945, 'nod..."


In [107]:
issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])
issues.head()

Unnamed: 0,number,title,labels,state
0,22628,BUG: Some sas7bdat files with many columns are...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
1,22627,Series.reorder_levels docstring includes extra...,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
2,22624,Refactor test_sql.py,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open
3,22622,Add DataFrame.corrmatrix() method,"[{'id': 57296398, 'node_id': 'MDU6TGFiZWw1NzI5...",open
4,22621,Dtype inconsistency when appending to empty da...,"[{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw...",open


# Accessing databases

Note that some datasets are available only in SQLite formats.  It's useful to know how to load those databases into pandas.
The following dataset is a collection of about 500,000 fine food reviews from Amazon. (https://www.kaggle.com/snap/amazon-fine-food-reviews/home).

In [108]:
import sqlite3

In [109]:
con = sqlite3.connect('data/fine_food_reviews.sqlite')

In [119]:
query = "SELECT score,count(*) FROM Reviews group by SCORE order by SCORE DESC"

In [120]:
cursor = con.execute(query)

In [121]:
rows = cursor.fetchall()

In [122]:
rows

[(5, 363122), (4, 80655), (3, 42640), (2, 29769), (1, 52268)]